[PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs

Linux PCI subsystem development
 help / color / mirror / Atom feed

* [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
       [not found] <20250915072428.1712837-1-vivek.kasireddy@intel.com>
@ 2025-09-15  7:21 ` Vivek Kasireddy
  2025-09-15 15:33   ` Logan Gunthorpe
                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Vivek Kasireddy @ 2025-09-15  7:21 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: Vivek Kasireddy, Bjorn Helgaas, Logan Gunthorpe, linux-pci

Typically, functions of the same PCI device (such as a PF and a VF)
share the same bus and have a common root port and the PF provisions
resources for the VF. Given this model, they can be considered
compatible as far as P2PDMA access is considered.

Currently, although the distance (2) is correctly calculated for
functions of the same device, an ACS check failure prevents P2P DMA
access between them. Therefore, introduce a small function named
pci_devfns_support_p2pdma() to determine if the provider and client
belong to the same device and facilitate P2PDMA between them by
not enforcing the ACS check.

However, since it is hard to determine if the device functions of
any given PCI device are P2PDMA compatible, we only relax the ACS
check enforcement for device functions of Intel GPUs. This is
because the P2PDMA communication between the PF and VF of Intel
GPUs is handled internally and does not typically involve the PCIe
fabric.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <linux-pci@vger.kernel.org>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
v1 -> v2:
- Relax the enforcment of ACS check only for Intel GPU functions
  as they are P2PDMA compatible given the way the PF provisions
  the resources among multiple VFs.

v2 -> v3:
- s/pci_devs_are_p2pdma_compatible/pci_devfns_support_p2pdma
- Improve the commit message to explain the reasoning behind
  relaxing the ACS check enforcement only for Intel GPU functions.

v3 -> v4: (Logan)
- Drop the dev_is_pf() hunk as no special handling is needed for PFs
- Besides the provider, also check to see the client is an Intel GPU
---
 drivers/pci/p2pdma.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index da5657a02007..0a1d884cd0ff 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -544,6 +544,19 @@ static unsigned long map_types_idx(struct pci_dev *client)
 	return (pci_domain_nr(client->bus) << 16) | pci_dev_id(client);
 }
 
+static bool pci_devfns_support_p2pdma(struct pci_dev *provider,
+				      struct pci_dev *client)
+{
+	if (provider->vendor == PCI_VENDOR_ID_INTEL &&
+	    client->vendor == PCI_VENDOR_ID_INTEL) {
+		if ((pci_is_vga(provider) && pci_is_vga(client)) ||
+		    (pci_is_display(provider) && pci_is_display(client)))
+			return pci_physfn(provider) == pci_physfn(client);
+	}
+
+	return false;
+}
+
 /*
  * Calculate the P2PDMA mapping type and distance between two PCI devices.
  *
@@ -643,7 +656,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client,
 
 	*dist = dist_a + dist_b;
 
-	if (!acs_cnt) {
+	if (!acs_cnt || pci_devfns_support_p2pdma(provider, client)) {
 		map_type = PCI_P2PDMA_MAP_BUS_ADDR;
 		goto done;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-15  7:21 ` [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs Vivek Kasireddy
@ 2025-09-15 15:33   ` Logan Gunthorpe
  2025-09-16 17:34   ` Bjorn Helgaas
  2025-09-16 17:57   ` Jason Gunthorpe
  2 siblings, 0 replies; 46+ messages in thread
From: Logan Gunthorpe @ 2025-09-15 15:33 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, intel-xe; +Cc: Bjorn Helgaas, linux-pci



On 2025-09-15 01:21, Vivek Kasireddy wrote:
> Typically, functions of the same PCI device (such as a PF and a VF)
> share the same bus and have a common root port and the PF provisions
> resources for the VF. Given this model, they can be considered
> compatible as far as P2PDMA access is considered.
> 
> Currently, although the distance (2) is correctly calculated for
> functions of the same device, an ACS check failure prevents P2P DMA
> access between them. Therefore, introduce a small function named
> pci_devfns_support_p2pdma() to determine if the provider and client
> belong to the same device and facilitate P2PDMA between them by
> not enforcing the ACS check.
> 
> However, since it is hard to determine if the device functions of
> any given PCI device are P2PDMA compatible, we only relax the ACS
> check enforcement for device functions of Intel GPUs. This is
> because the P2PDMA communication between the PF and VF of Intel
> GPUs is handled internally and does not typically involve the PCIe
> fabric.
> 
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: <linux-pci@vger.kernel.org>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>

This looks good to me, thanks.

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-15  7:21 ` [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs Vivek Kasireddy
  2025-09-15 15:33   ` Logan Gunthorpe
@ 2025-09-16 17:34   ` Bjorn Helgaas
  2025-09-16 17:59     ` Jason Gunthorpe
  2025-09-16 17:57   ` Jason Gunthorpe
  2 siblings, 1 reply; 46+ messages in thread
From: Bjorn Helgaas @ 2025-09-16 17:34 UTC (permalink / raw)
  To: Vivek Kasireddy
  Cc: dri-devel, intel-xe, Bjorn Helgaas, Logan Gunthorpe, linux-pci,
	Jason Gunthorpe

[+cc Jason, also doing a lot of ACS work]

On Mon, Sep 15, 2025 at 12:21:05AM -0700, Vivek Kasireddy wrote:
> Typically, functions of the same PCI device (such as a PF and a VF)
> share the same bus and have a common root port and the PF provisions
> resources for the VF. Given this model, they can be considered
> compatible as far as P2PDMA access is considered.

These seem like more than just "typical".  Such devices *always* have
a common Root Port and a PF *always* provisions VF resources.  I guess
it's "typical" or at least common that a PF and VF share the same bus.

> Currently, although the distance (2) is correctly calculated for
> functions of the same device, an ACS check failure prevents P2P DMA
> access between them. Therefore, introduce a small function named
> pci_devfns_support_p2pdma() to determine if the provider and client
> belong to the same device and facilitate P2PDMA between them by
> not enforcing the ACS check.
> 
> However, since it is hard to determine if the device functions of
> any given PCI device are P2PDMA compatible, we only relax the ACS
> check enforcement for device functions of Intel GPUs. This is
> because the P2PDMA communication between the PF and VF of Intel
> GPUs is handled internally and does not typically involve the PCIe
> fabric.
> 
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: <linux-pci@vger.kernel.org>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---
> v1 -> v2:
> - Relax the enforcment of ACS check only for Intel GPU functions
>   as they are P2PDMA compatible given the way the PF provisions
>   the resources among multiple VFs.
> 
> v2 -> v3:
> - s/pci_devs_are_p2pdma_compatible/pci_devfns_support_p2pdma
> - Improve the commit message to explain the reasoning behind
>   relaxing the ACS check enforcement only for Intel GPU functions.
> 
> v3 -> v4: (Logan)
> - Drop the dev_is_pf() hunk as no special handling is needed for PFs
> - Besides the provider, also check to see the client is an Intel GPU
> ---
>  drivers/pci/p2pdma.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index da5657a02007..0a1d884cd0ff 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -544,6 +544,19 @@ static unsigned long map_types_idx(struct pci_dev *client)
>  	return (pci_domain_nr(client->bus) << 16) | pci_dev_id(client);
>  }
>  
> +static bool pci_devfns_support_p2pdma(struct pci_dev *provider,
> +				      struct pci_dev *client)
> +{
> +	if (provider->vendor == PCI_VENDOR_ID_INTEL &&
> +	    client->vendor == PCI_VENDOR_ID_INTEL) {
> +		if ((pci_is_vga(provider) && pci_is_vga(client)) ||
> +		    (pci_is_display(provider) && pci_is_display(client)))
> +			return pci_physfn(provider) == pci_physfn(client);
> +	}

I know I've asked this before, but I'm still confused about how this
is related to PCIe r7.0, sec 7.7.12, which says that if an SR-IOV
device implements internal peer-to-peer transactions, ACS is required,
and ACS P2P Egress Control must be supported.

Are you saying that these Intel GPUs don't conform to this?

Or they do, but it's not enough to solve this issue?

Or something else?

Maybe if we add the right comment here, it will keep me from asking
again :)

> +	return false;
> +}
> +
>  /*
>   * Calculate the P2PDMA mapping type and distance between two PCI devices.
>   *
> @@ -643,7 +656,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client,
>  
>  	*dist = dist_a + dist_b;
>  
> -	if (!acs_cnt) {
> +	if (!acs_cnt || pci_devfns_support_p2pdma(provider, client)) {
>  		map_type = PCI_P2PDMA_MAP_BUS_ADDR;
>  		goto done;
>  	}
> -- 
> 2.50.1
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-16 17:34   ` Bjorn Helgaas
@ 2025-09-16 17:59     ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:59 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Vivek Kasireddy, dri-devel, intel-xe, Bjorn Helgaas,
	Logan Gunthorpe, linux-pci

On Tue, Sep 16, 2025 at 12:34:42PM -0500, Bjorn Helgaas wrote:

> I know I've asked this before, but I'm still confused about how this
> is related to PCIe r7.0, sec 7.7.12, which says that if an SR-IOV
> device implements internal peer-to-peer transactions, ACS is required,
> and ACS P2P Egress Control must be supported.

Right, certainly for SRIOV Linux has always taken the view that VFs
and PFs have NO internal loopback if there are no ACS caps.

The entire industry has aligned to this because having an hidden
uncontrolled internal loopback where the VF can reach the PF's BAR
would be catastrophically security broken for virtualization.

Virtualization is the main use case for having SRIOV in the first
place, so nobody would build an insecure internal loopback..

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-15  7:21 ` [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs Vivek Kasireddy
  2025-09-15 15:33   ` Logan Gunthorpe
  2025-09-16 17:34   ` Bjorn Helgaas
@ 2025-09-16 17:57   ` Jason Gunthorpe
  2025-09-18  6:16     ` Kasireddy, Vivek
  2 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:57 UTC (permalink / raw)
  To: Vivek Kasireddy
  Cc: dri-devel, intel-xe, Bjorn Helgaas, Logan Gunthorpe, linux-pci

On Mon, Sep 15, 2025 at 12:21:05AM -0700, Vivek Kasireddy wrote:
> Typically, functions of the same PCI device (such as a PF and a VF)
> share the same bus and have a common root port and the PF provisions
> resources for the VF. Given this model, they can be considered
> compatible as far as P2PDMA access is considered.

Huh? I'm not sure I understand what this is about. Please be more
clear what your use case is and what exactly is not working.

If it is talking about internal loopback within a single function
between PF and VF, then no, this is very expressly not something that
should be expected to work by default!

In fact I would consider any SRIOV capable device that had such a
behavior by default to be catastrophically security broken.

So this patch can't be talking about that, right?

Yet that is what this code seems to do?!?!?

> +static bool pci_devfns_support_p2pdma(struct pci_dev *provider,
> +				      struct pci_dev *client)
> +{
> +	if (provider->vendor == PCI_VENDOR_ID_INTEL &&
> +	    client->vendor == PCI_VENDOR_ID_INTEL) {
> +		if ((pci_is_vga(provider) && pci_is_vga(client)) ||
> +		    (pci_is_display(provider) && pci_is_display(client)))
> +			return pci_physfn(provider) == pci_physfn(client);
> +	}

Do not open code quirks like this in random places, if this device
supports some weird ACS behavior and does not include it in the ACS
Caps the right place is to supply an ACS quirk in quirks.c so all the
code knows about the device behavior, including the iommu grouping.

If your device supports P2P between VF and PF then iommu grouping must
put VFs in the PF's group and you loose VFIO support.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-16 17:57   ` Jason Gunthorpe
@ 2025-09-18  6:16     ` Kasireddy, Vivek
  2025-09-18 12:04       ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-18  6:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

Hi Jason,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On Mon, Sep 15, 2025 at 12:21:05AM -0700, Vivek Kasireddy wrote:
> > Typically, functions of the same PCI device (such as a PF and a VF)
> > share the same bus and have a common root port and the PF provisions
> > resources for the VF. Given this model, they can be considered
> > compatible as far as P2PDMA access is considered.
> 
> Huh? I'm not sure I understand what this is about. Please be more
> clear what your use case is and what exactly is not working.
> 
> If it is talking about internal loopback within a single function
> between PF and VF, then no, this is very expressly not something that
> should be expected to work by default!
> 
> In fact I would consider any SRIOV capable device that had such a
> behavior by default to be catastrophically security broken.
> 
> So this patch can't be talking about that, right?
> 
> Yet that is what this code seems to do?!?!?
Here is my use-case:
- Xe Graphics driver, bound to GPU PF on the Host provisions its resources
including VRAM (aka device/graphics memory) among all the VFs.

- A GPU VF device is bound to vfio-pci and assigned to a Linux VM which
is launched via Qemu.

- The Xe Graphics driver running inside the Linux VM creates a buffer in
the VF's portion (or share) of the VRAM and this buffer is shared with
Qemu. Qemu then requests vfio-pci driver to create a dmabuf associated
with this buffer. Note that I am testing with Leon's vfio-pci series included
(vfio/pci: Allow MMIO regions to be exported through dma-buf)

- Next, Qemu requests the GPU PF (via the Xe driver) to import (or access)
the dmabuf (or buffer) located in VF's portion of the VRAM. This is where a
problem occurs. 

The exporter (vfio-pci driver in this case) calls pci_p2pdma_map_type() to
determine the map type between both devices (GPU VF and PF) but it fails
due to the ACS enforcement check.

However, assuming that pci_p2pdma_map_type() did not fail, based on my
experiments, the GPU PF is still unable to access the buffer located in VF's
VRAM portion directly because it is represented using PCI BAR addresses.

The only way this seems to be working at the moment is if the BAR addresses
are translated into VRAM addresses that the GPU PF understands (this is done
done inside Xe driver on the Host using provisioning data). Note that this buffer
is accessible by the CPU using BAR addresses but it is very slow.

So, given that the GPU PF does not need to use PCIe fabric/machinery in order
to access the buffer located in GPU VF's portion of the VRAM in this use-case,
I figured adding a quirk (to not enforce ACS check) here would make sense. 

> 
> > +static bool pci_devfns_support_p2pdma(struct pci_dev *provider,
> > +				      struct pci_dev *client)
> > +{
> > +	if (provider->vendor == PCI_VENDOR_ID_INTEL &&
> > +	    client->vendor == PCI_VENDOR_ID_INTEL) {
> > +		if ((pci_is_vga(provider) && pci_is_vga(client)) ||
> > +		    (pci_is_display(provider) && pci_is_display(client)))
> > +			return pci_physfn(provider) == pci_physfn(client);
> > +	}
> 
> Do not open code quirks like this in random places, if this device
> supports some weird ACS behavior and does not include it in the ACS
> Caps the right place is to supply an ACS quirk in quirks.c so all the
> code knows about the device behavior, including the iommu grouping.
Ok, I'll move it to quirks.c.

> 
> If your device supports P2P between VF and PF then iommu grouping must
> put VFs in the PF's group and you loose VFIO support.
On my test system, it looks like the VFs and the PF are put into different
iommu groups. I am checking with our hardware folks to understand how this
is expected to work but does it mean that P2P between PF and VF is not
supported in my case?

Also, we do need VFIO support. Otherwise, I don't see any other way as to
how a GPU VF device can be assigned (or passthrough'd) to a Guest VM. 

Thanks,
Vivek
> 
> Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-18  6:16     ` Kasireddy, Vivek
@ 2025-09-18 12:04       ` Jason Gunthorpe
  2025-09-19  6:22         ` Kasireddy, Vivek
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-18 12:04 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

On Thu, Sep 18, 2025 at 06:16:38AM +0000, Kasireddy, Vivek wrote:

> However, assuming that pci_p2pdma_map_type() did not fail, based on my
> experiments, the GPU PF is still unable to access the buffer located in VF's
> VRAM portion directly because it is represented using PCI BAR addresses.

In this case messing with ACS is completely wrong. If the intention is
to convay a some kind of "private" address representing the physical
VRAM then you need to use a DMABUF mechanism to do that, not deliver a
P2P address that the other side cannot access.

Christian told me dmabuf has such a private address mechanism, so
please figure out a way to use it..

> > Do not open code quirks like this in random places, if this device
> > supports some weird ACS behavior and does not include it in the ACS
> > Caps the right place is to supply an ACS quirk in quirks.c so all the
> > code knows about the device behavior, including the iommu grouping.
> Ok, I'll move it to quirks.c.

No, don't, it is completely wrong to mess with ACS flags for the
problem you are trying to solve.

> On my test system, it looks like the VFs and the PF are put into different
> iommu groups. I am checking with our hardware folks to understand how this
> is expected to work but does it mean that P2P between PF and VF is not
> supported in my case?

A special internal path through VRAM is outside the scope of iommu
grouping.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-18 12:04       ` Jason Gunthorpe
@ 2025-09-19  6:22         ` Kasireddy, Vivek
  2025-09-19 12:29           ` Jason Gunthorpe
  2025-09-24 16:13           ` Simon Richter
  0 siblings, 2 replies; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-19  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

Hi Jason,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On Thu, Sep 18, 2025 at 06:16:38AM +0000, Kasireddy, Vivek wrote:
> 
> > However, assuming that pci_p2pdma_map_type() did not fail, based on
> my
> > experiments, the GPU PF is still unable to access the buffer located in VF's
> > VRAM portion directly because it is represented using PCI BAR addresses.
> 
> In this case messing with ACS is completely wrong. If the intention is
> to convay a some kind of "private" address representing the physical
> VRAM then you need to use a DMABUF mechanism to do that, not deliver a
> P2P address that the other side cannot access.
I think using a PCI BAR Address works just fine in this case because the Xe
driver bound to PF on the Host can easily determine that it belongs to one
of the VFs and translate it into VRAM Address.

> 
> Christian told me dmabuf has such a private address mechanism, so
> please figure out a way to use it..
Even if such as a mechanism exists, we still need a way to prevent
pci_p2pdma_map_type() from failing when invoked by the exporter (vfio-pci).
Does it make sense to move this quirk into the exporter?

Also, AFAICS, translating BAR Address to VRAM Address can only be
done by the Xe driver bound to PF because it has access to provisioning
data. In other words, vfio-pci would not be able to share any other
address other than the BAR Address because it wouldn't know how to
translate it to VRAM Address.

> 
> > > Do not open code quirks like this in random places, if this device
> > > supports some weird ACS behavior and does not include it in the ACS
> > > Caps the right place is to supply an ACS quirk in quirks.c so all the
> > > code knows about the device behavior, including the iommu grouping.
> > Ok, I'll move it to quirks.c.
> 
> No, don't, it is completely wrong to mess with ACS flags for the
> problem you are trying to solve.
But I am not messing with any ACS flags here. I am just adding a quirk to
sidestep the ACS enforcement check given that the PF to VF access does
not involve the PCIe fabric in this case.

> 
> > On my test system, it looks like the VFs and the PF are put into different
> > iommu groups. I am checking with our hardware folks to understand how
> this
> > is expected to work but does it mean that P2P between PF and VF is not
> > supported in my case?
> 
> A special internal path through VRAM is outside the scope of iommu
> grouping.
The BAR to VRAM Address translation logic is being added to the Xe driver
as part of this patch series, so, nothing else needs to be done except a way
to enable the exporter(vfio-pci) to share the dmabuf with the PF without
having to rely on the outcome of pci_p2pdma_map_type() in this use-case.

Thanks,
Vivek

> 
> Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-19  6:22         ` Kasireddy, Vivek
@ 2025-09-19 12:29           ` Jason Gunthorpe
  2025-09-22  6:59             ` Kasireddy, Vivek
  2025-09-24 16:13           ` Simon Richter
  1 sibling, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-19 12:29 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
> > In this case messing with ACS is completely wrong. If the intention is
> > to convay a some kind of "private" address representing the physical
> > VRAM then you need to use a DMABUF mechanism to do that, not deliver a
> > P2P address that the other side cannot access.

> I think using a PCI BAR Address works just fine in this case because the Xe
> driver bound to PF on the Host can easily determine that it belongs to one
> of the VFs and translate it into VRAM Address.

That isn't how the P2P or ACS mechansim works in Linux, it is about
the actual address used for DMA.

You can't translate a dma_addr_t to anything in the Xe PF driver
anyhow, once it goes through the IOMMU the necessary information is
lost. This is a fundamentally broken design to dma map something and
then try to reverse engineer the dma_addr_t back to something with
meaning.

> > Christian told me dmabuf has such a private address mechanism, so
> > please figure out a way to use it..
>
> Even if such as a mechanism exists, we still need a way to prevent
> pci_p2pdma_map_type() from failing when invoked by the exporter (vfio-pci).
> Does it make sense to move this quirk into the exporter?

When you export a private address through dmabuf the VFIO exporter
will not call p2pdma paths when generating it.

> Also, AFAICS, translating BAR Address to VRAM Address can only be
> done by the Xe driver bound to PF because it has access to provisioning
> data. In other words, vfio-pci would not be able to share any other
> address other than the BAR Address because it wouldn't know how to
> translate it to VRAM Address.

If you have a vfio varient driver then the VF vfio driver could call
the Xe driver to create a suitable dmabuf using the private
addressing. This is probably what is required here if this is what you
are trying to do.

> > No, don't, it is completely wrong to mess with ACS flags for the
> > problem you are trying to solve.

> But I am not messing with any ACS flags here. I am just adding a quirk to
> sidestep the ACS enforcement check given that the PF to VF access does
> not involve the PCIe fabric in this case.

Which is completely wrong. These are all based on fabric capability,
not based on code in drivers to wrongly "translate" the dma_addr_t.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-19 12:29           ` Jason Gunthorpe
@ 2025-09-22  6:59             ` Kasireddy, Vivek
  2025-09-22 11:22               ` Christian König
  2025-09-22 12:12               ` Jason Gunthorpe
  0 siblings, 2 replies; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-22  6:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König, Simona Vetter
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

Hi Jason,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
> > > In this case messing with ACS is completely wrong. If the intention is
> > > to convay a some kind of "private" address representing the physical
> > > VRAM then you need to use a DMABUF mechanism to do that, not
> deliver a
> > > P2P address that the other side cannot access.
> 
> > I think using a PCI BAR Address works just fine in this case because the Xe
> > driver bound to PF on the Host can easily determine that it belongs to one
> > of the VFs and translate it into VRAM Address.
> 
> That isn't how the P2P or ACS mechansim works in Linux, it is about
> the actual address used for DMA.
Right, but this is not dealing with P2P DMA access between two random,
unrelated devices. Instead, this is a special situation involving a GPU PF
trying to access the VRAM of a VF that it provisioned and holds a reference
on (note that the backing object for VF's VRAM is pinned by Xe on Host
as part of resource provisioning). But it gets treated as regular P2P DMA
because the exporters rely on pci_p2pdma_distance() or
pci_p2pdma_map_type() to determine P2P compatibility.

In other words, I am trying to look at this problem differently: how can the
PF be allowed to access the VF's resource that it provisioned, particularly
when the VF itself requests the PF to access it and when a hardware path
(via PCIe fabric) is not required/supported or doesn't exist at all?

Furthermore, note that on a server system with a whitelisted PCIe upstream
bridge, this quirk would not be needed at all as pci_p2pdma_map_type()
would not have failed and this would have been a purely Xe driver specific
problem to solve that would have required just the translation logic and no
further changes anywhere. But my goal is to fix it across systems like
workstations/desktops that do not typically have whitelisted PCIe upstream
bridges.

> 
> You can't translate a dma_addr_t to anything in the Xe PF driver
> anyhow, once it goes through the IOMMU the necessary information is lost.
Well, I already tested this path (via IOMMU, with your earlier vfio-pci +
dmabuf patch that used dma_map_resource() and also with Leon's latest
version) and found that I could still do the translation in the Xe PF driver
after first calling iommu_iova_to_phys().

> This is a fundamentally broken design to dma map something and
> then try to reverse engineer the dma_addr_t back to something with
> meaning.
IIUC, I don't think this is a new or radical idea. I think the concept is slightly
similar to using bounce buffers to address hardware DMA limitations except
that there are no memory copies and the CPU is not involved. And, I don't see
any other way to do this because I don't believe the exporter can provide a
DMA address that the importer can use directly without any translation, which
seems unavoidable in this case.

> 
> > > Christian told me dmabuf has such a private address mechanism, so
> > > please figure out a way to use it..
> >
> > Even if such as a mechanism exists, we still need a way to prevent
> > pci_p2pdma_map_type() from failing when invoked by the exporter (vfio-
> pci).
> > Does it make sense to move this quirk into the exporter?
> 
> When you export a private address through dmabuf the VFIO exporter
> will not call p2pdma paths when generating it.
I have cc'd Christian and Simona. Hopefully, they can help explain how
the dmabuf private address mechanism can be used to address my
use-case. And, I sincerely hope that it will work, otherwise I don't see
any viable path forward for what I am trying to do other than using this
quirk and translation. Note that the main reason why I am doing this
is because I am seeing at-least ~35% performance gain when running
light 3D/Gfx workloads.

> 
> > Also, AFAICS, translating BAR Address to VRAM Address can only be
> > done by the Xe driver bound to PF because it has access to provisioning
> > data. In other words, vfio-pci would not be able to share any other
> > address other than the BAR Address because it wouldn't know how to
> > translate it to VRAM Address.
> 
> If you have a vfio varient driver then the VF vfio driver could call
> the Xe driver to create a suitable dmabuf using the private
> addressing. This is probably what is required here if this is what you
> are trying to do.
Could this not be done via the vendor agnostic vfio-pci (+ dmabuf) driver
instead of having to use a separate VF/vfio variant driver?

> 
> > > No, don't, it is completely wrong to mess with ACS flags for the
> > > problem you are trying to solve.
> 
> > But I am not messing with any ACS flags here. I am just adding a quirk to
> > sidestep the ACS enforcement check given that the PF to VF access does
> > not involve the PCIe fabric in this case.
> 
> Which is completely wrong. These are all based on fabric capability,
> not based on code in drivers to wrongly "translate" the dma_addr_t.
I am not sure why you consider translation to be wrong in this case
given that it is done by a trusted entity (Xe PF driver) that is bound to
the GPU PF and provisioned the resource that it is trying to access.
What limitations do you see with this approach?

Also, the quirk being added in this patch is indeed meant to address a
specific case (GPU PF to VF access) to workaround a potential hardware
limitation (non-existence of a direct PF to VF DMA access path via the
PCIe fabric). Isn't that one of the main ideas behind using quirks -- to
address hardware limitations?

Thanks,
Vivek

> 
> Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22  6:59             ` Kasireddy, Vivek
@ 2025-09-22 11:22               ` Christian König
  2025-09-22 12:20                 ` Jason Gunthorpe
  2025-09-23  6:01                 ` Kasireddy, Vivek
  2025-09-22 12:12               ` Jason Gunthorpe
  1 sibling, 2 replies; 46+ messages in thread
From: Christian König @ 2025-09-22 11:22 UTC (permalink / raw)
  To: Kasireddy, Vivek, Jason Gunthorpe, Simona Vetter
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

Hi guys,

On 22.09.25 08:59, Kasireddy, Vivek wrote:
> Hi Jason,
> 
>> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
>> functions of Intel GPUs
>>
>> On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
>>>> In this case messing with ACS is completely wrong. If the intention is
>>>> to convay a some kind of "private" address representing the physical
>>>> VRAM then you need to use a DMABUF mechanism to do that, not
>> deliver a
>>>> P2P address that the other side cannot access.
>>
>>> I think using a PCI BAR Address works just fine in this case because the Xe
>>> driver bound to PF on the Host can easily determine that it belongs to one
>>> of the VFs and translate it into VRAM Address.
>>
>> That isn't how the P2P or ACS mechansim works in Linux, it is about
>> the actual address used for DMA.
> Right, but this is not dealing with P2P DMA access between two random,
> unrelated devices. Instead, this is a special situation involving a GPU PF
> trying to access the VRAM of a VF that it provisioned and holds a reference
> on (note that the backing object for VF's VRAM is pinned by Xe on Host
> as part of resource provisioning). But it gets treated as regular P2P DMA
> because the exporters rely on pci_p2pdma_distance() or
> pci_p2pdma_map_type() to determine P2P compatibility.
> 
> In other words, I am trying to look at this problem differently: how can the
> PF be allowed to access the VF's resource that it provisioned, particularly
> when the VF itself requests the PF to access it and when a hardware path
> (via PCIe fabric) is not required/supported or doesn't exist at all?

Well what exactly is happening here? You have a PF assigned to the host and a VF passed through to a guest, correct?

And now the PF (from the host side) wants to access a BAR of the VF?

Regards,
Christian.

> 
> Furthermore, note that on a server system with a whitelisted PCIe upstream
> bridge, this quirk would not be needed at all as pci_p2pdma_map_type()
> would not have failed and this would have been a purely Xe driver specific
> problem to solve that would have required just the translation logic and no
> further changes anywhere. But my goal is to fix it across systems like
> workstations/desktops that do not typically have whitelisted PCIe upstream
> bridges.
> 
>>
>> You can't translate a dma_addr_t to anything in the Xe PF driver
>> anyhow, once it goes through the IOMMU the necessary information is lost.
> Well, I already tested this path (via IOMMU, with your earlier vfio-pci +
> dmabuf patch that used dma_map_resource() and also with Leon's latest
> version) and found that I could still do the translation in the Xe PF driver
> after first calling iommu_iova_to_phys().
> 
>> This is a fundamentally broken design to dma map something and
>> then try to reverse engineer the dma_addr_t back to something with
>> meaning.
> IIUC, I don't think this is a new or radical idea. I think the concept is slightly
> similar to using bounce buffers to address hardware DMA limitations except
> that there are no memory copies and the CPU is not involved. And, I don't see
> any other way to do this because I don't believe the exporter can provide a
> DMA address that the importer can use directly without any translation, which
> seems unavoidable in this case.
> 
>>
>>>> Christian told me dmabuf has such a private address mechanism, so
>>>> please figure out a way to use it..
>>>
>>> Even if such as a mechanism exists, we still need a way to prevent
>>> pci_p2pdma_map_type() from failing when invoked by the exporter (vfio-
>> pci).
>>> Does it make sense to move this quirk into the exporter?
>>
>> When you export a private address through dmabuf the VFIO exporter
>> will not call p2pdma paths when generating it.
> I have cc'd Christian and Simona. Hopefully, they can help explain how
> the dmabuf private address mechanism can be used to address my
> use-case. And, I sincerely hope that it will work, otherwise I don't see
> any viable path forward for what I am trying to do other than using this
> quirk and translation. Note that the main reason why I am doing this
> is because I am seeing at-least ~35% performance gain when running
> light 3D/Gfx workloads.
> 
>>
>>> Also, AFAICS, translating BAR Address to VRAM Address can only be
>>> done by the Xe driver bound to PF because it has access to provisioning
>>> data. In other words, vfio-pci would not be able to share any other
>>> address other than the BAR Address because it wouldn't know how to
>>> translate it to VRAM Address.
>>
>> If you have a vfio varient driver then the VF vfio driver could call
>> the Xe driver to create a suitable dmabuf using the private
>> addressing. This is probably what is required here if this is what you
>> are trying to do.
> Could this not be done via the vendor agnostic vfio-pci (+ dmabuf) driver
> instead of having to use a separate VF/vfio variant driver?
> 
>>
>>>> No, don't, it is completely wrong to mess with ACS flags for the
>>>> problem you are trying to solve.
>>
>>> But I am not messing with any ACS flags here. I am just adding a quirk to
>>> sidestep the ACS enforcement check given that the PF to VF access does
>>> not involve the PCIe fabric in this case.
>>
>> Which is completely wrong. These are all based on fabric capability,
>> not based on code in drivers to wrongly "translate" the dma_addr_t.
> I am not sure why you consider translation to be wrong in this case
> given that it is done by a trusted entity (Xe PF driver) that is bound to
> the GPU PF and provisioned the resource that it is trying to access.
> What limitations do you see with this approach?
> 
> Also, the quirk being added in this patch is indeed meant to address a
> specific case (GPU PF to VF access) to workaround a potential hardware
> limitation (non-existence of a direct PF to VF DMA access path via the
> PCIe fabric). Isn't that one of the main ideas behind using quirks -- to
> address hardware limitations?
> 
> Thanks,
> Vivek
> 
>>
>> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 11:22               ` Christian König
@ 2025-09-22 12:20                 ` Jason Gunthorpe
  2025-09-22 12:25                   ` Christian König
  2025-09-23  5:53                   ` Kasireddy, Vivek
  2025-09-23  6:01                 ` Kasireddy, Vivek
  1 sibling, 2 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 12:20 UTC (permalink / raw)
  To: Christian König
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:

> Well what exactly is happening here? You have a PF assigned to the
> host and a VF passed through to a guest, correct?
>
> And now the PF (from the host side) wants to access a BAR of the VF?

Not quite.

It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
the VF can access some VRAM.

They want to get a DMABUF handle for a bit of the VF's reachable VRAM
that the PF can import and use through it's own funciton.

The use of the VF's BAR in this series is an ugly hack. The PF never
actually uses the VF BAR, it just hackily converts the dma_addr_t back
to CPU physical and figures out where it is in the VRAM pool and then
uses a PF centric address for it.

All they want is either the actual VRAM address or the CPU physical.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 12:20                 ` Jason Gunthorpe
@ 2025-09-22 12:25                   ` Christian König
  2025-09-22 12:29                     ` Jason Gunthorpe
  2025-09-23  5:53                   ` Kasireddy, Vivek
  1 sibling, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-22 12:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On 22.09.25 14:20, Jason Gunthorpe wrote:
> On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
> 
>> Well what exactly is happening here? You have a PF assigned to the
>> host and a VF passed through to a guest, correct?
>>
>> And now the PF (from the host side) wants to access a BAR of the VF?
> 
> Not quite.
> 
> It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
> the VF can access some VRAM.
> 
> They want to get a DMABUF handle for a bit of the VF's reachable VRAM
> that the PF can import and use through it's own funciton.

Yeah, where's the problem? We do that all the time.

> The use of the VF's BAR in this series is an ugly hack. The PF never
> actually uses the VF BAR, it just hackily converts the dma_addr_t back
> to CPU physical and figures out where it is in the VRAM pool and then
> uses a PF centric address for it.

Oh my, please absolutely don't do that!

If you have a device internal connection then you need special handling between your PF and VF driver.

But I still don't have the full picture of who is using what here, e.g. who is providing the DMA-buf handle and who is using it?

Regards,
Christian.

> 
> All they want is either the actual VRAM address or the CPU physical.
> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 12:25                   ` Christian König
@ 2025-09-22 12:29                     ` Jason Gunthorpe
  2025-09-22 13:20                       ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 12:29 UTC (permalink / raw)
  To: Christian König
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Mon, Sep 22, 2025 at 02:25:15PM +0200, Christian König wrote:
> On 22.09.25 14:20, Jason Gunthorpe wrote:
> > On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
> > 
> >> Well what exactly is happening here? You have a PF assigned to the
> >> host and a VF passed through to a guest, correct?
> >>
> >> And now the PF (from the host side) wants to access a BAR of the VF?
> > 
> > Not quite.
> > 
> > It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
> > the VF can access some VRAM.
> > 
> > They want to get a DMABUF handle for a bit of the VF's reachable VRAM
> > that the PF can import and use through it's own funciton.
> 
> Yeah, where's the problem? We do that all the time.

Well this is the problem:

> > The use of the VF's BAR in this series is an ugly hack. The PF never
> > actually uses the VF BAR, it just hackily converts the dma_addr_t back
> > to CPU physical and figures out where it is in the VRAM pool and then
> > uses a PF centric address for it.
> 
> Oh my, please absolutely don't do that!

Great.

> If you have a device internal connection then you need special
> handling between your PF and VF driver.

Generic VFIO can't do something like that, so they would need to make
a special variant VFIO driver.

> But I still don't have the full picture of who is using what here,
> e.g. who is providing the DMA-buf handle and who is using it?

Generic VFIO is the exporer, the xe driver is the importer.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 12:29                     ` Jason Gunthorpe
@ 2025-09-22 13:20                       ` Christian König
  2025-09-22 13:27                         ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-22 13:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On 22.09.25 14:29, Jason Gunthorpe wrote:
> On Mon, Sep 22, 2025 at 02:25:15PM +0200, Christian König wrote:
>> On 22.09.25 14:20, Jason Gunthorpe wrote:
>>> On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
>>>
>>>> Well what exactly is happening here? You have a PF assigned to the
>>>> host and a VF passed through to a guest, correct?
>>>>
>>>> And now the PF (from the host side) wants to access a BAR of the VF?
>>>
>>> Not quite.
>>>
>>> It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
>>> the VF can access some VRAM.
>>>
>>> They want to get a DMABUF handle for a bit of the VF's reachable VRAM
>>> that the PF can import and use through it's own funciton.
>>
>> Yeah, where's the problem? We do that all the time.
> 
> Well this is the problem:
> 
>>> The use of the VF's BAR in this series is an ugly hack. The PF never
>>> actually uses the VF BAR, it just hackily converts the dma_addr_t back
>>> to CPU physical and figures out where it is in the VRAM pool and then
>>> uses a PF centric address for it.
>>
>> Oh my, please absolutely don't do that!
> 
> Great.
> 
>> If you have a device internal connection then you need special
>> handling between your PF and VF driver.
> 
> Generic VFIO can't do something like that, so they would need to make
> a special variant VFIO driver.
>
>> But I still don't have the full picture of who is using what here,
>> e.g. who is providing the DMA-buf handle and who is using it?
> 
> Generic VFIO is the exporer, the xe driver is the importer.

Why in the world is the exporter the generic VFIO driver?

At least on AMD GPUs when you want to have a DMA-buf for a specific part of the VFs resources then you ask the hypervisor driver managing the PF for that and not the VFIO driver.

Background is that the VFIO only sees the BARs of the VF and that in turn is usually only a window giving access to a fraction of the resources assigned to the VF. In other words VF BAR is 256MiB in size while 4GiB of VRAM is assigned to the VF (for example).

With that design you don't run into issues with PCIe P2P in the first place. On the other hand when you want to do PCIe P2P to the VF BAR than asking the VFIO driver does make perfect sense, but that doesn't seem to be the use case here.

Regards,
Christian.

> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 13:20                       ` Christian König
@ 2025-09-22 13:27                         ` Jason Gunthorpe
  2025-09-22 13:57                           ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 13:27 UTC (permalink / raw)
  To: Christian König
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Mon, Sep 22, 2025 at 03:20:49PM +0200, Christian König wrote:

> At least on AMD GPUs when you want to have a DMA-buf for a specific
> part of the VFs resources then you ask the hypervisor driver
> managing the PF for that and not the VFIO driver.

Having a UAPI on the PF to give DMABUFs for arbitary VRAM seems
security exciting. I'd probably want to insist the calling process
prove to the PF driver that it also has access to the VF.

Having the VF create the DMABUF is one way to do that, but I imagine
there are other options.

But sure, if you can solve, or ignore, the security proof it makes a
lot more sense.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 13:27                         ` Jason Gunthorpe
@ 2025-09-22 13:57                           ` Christian König
  2025-09-22 14:00                             ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-22 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On 22.09.25 15:27, Jason Gunthorpe wrote:
> On Mon, Sep 22, 2025 at 03:20:49PM +0200, Christian König wrote:
> 
>> At least on AMD GPUs when you want to have a DMA-buf for a specific
>> part of the VFs resources then you ask the hypervisor driver
>> managing the PF for that and not the VFIO driver.
> 
> Having a UAPI on the PF to give DMABUFs for arbitary VRAM seems
> security exciting. I'd probably want to insist the calling process
> prove to the PF driver that it also has access to the VF.

Good point. In our use case it's the userspace hypervisor component (running as root) talking to the kernel hypervisor driver and it is only used to collect very specific crash information.

> Having the VF create the DMABUF is one way to do that, but I imagine
> there are other options.

Well how does the guest communicate to the host which parts of the VRAM should be exposed as DMA-buf in the first place?

> 
> But sure, if you can solve, or ignore, the security proof it makes a
> lot more sense.

I mean what I'm still missing is the whole picture. E.g. what is the reason why you want a DMA-buf of portions of the VFs resource on the host in the first place?

Is that for postmortem crash analysis? Providing some kind of service to the guest? Something completely different?

Regards,
Christian.

> 
> Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 13:57                           ` Christian König
@ 2025-09-22 14:00                             ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 14:00 UTC (permalink / raw)
  To: Christian König
  Cc: Kasireddy, Vivek, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Mon, Sep 22, 2025 at 03:57:02PM +0200, Christian König wrote:

> Is that for postmortem crash analysis? Providing some kind of
> service to the guest? Something completely different?

From the cover letter:

 With this patch series applied, it would become possible to display
 (via Qemu GTK UI) Guest VM compositor's framebuffer (created in its
 LMEM) on the Host without having to make any copies of it or a
 costly roundtrip to System RAM. And, weston-simple-egl can now
 achieve ~59 FPS while running with Gnome Wayland in the Guest VM.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 12:20                 ` Jason Gunthorpe
  2025-09-22 12:25                   ` Christian König
@ 2025-09-23  5:53                   ` Kasireddy, Vivek
  2025-09-23  6:25                     ` Matthew Brost
  1 sibling, 1 reply; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-23  5:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org, Thomas Hellström, Brost, Matthew

Hi Jason,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
> 
> > Well what exactly is happening here? You have a PF assigned to the
> > host and a VF passed through to a guest, correct?
> >
> > And now the PF (from the host side) wants to access a BAR of the VF?
> 
> Not quite.
> 
> It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
> the VF can access some VRAM.
> 
> They want to get a DMABUF handle for a bit of the VF's reachable VRAM
> that the PF can import and use through it's own funciton.
> 
> The use of the VF's BAR in this series is an ugly hack.
IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
to never expose VRAM Addresses and instead have BAR addresses as DMA
addresses when exporting dmabufs to other devices. Here is the relevant code
snippet in Xe:
                phys_addr_t phys = cursor.start + xe_vram_region_io_start(tile->mem.vram);             
                size_t size = min_t(u64, cursor.size, SZ_2G);                         
                dma_addr_t addr;                                                      
                                                                                      
                addr = dma_map_resource(dev, phys, size, dir,                         
                                        DMA_ATTR_SKIP_CPU_SYNC);

And, here is the one in amdgpu:
        for_each_sgtable_sg((*sgt), sg, i) {
                phys_addr_t phys = cursor.start + adev->gmc.aper_base;
                unsigned long size = min(cursor.size, AMDGPU_MAX_SG_SEGMENT_SIZE);
                dma_addr_t addr;

                addr = dma_map_resource(dev, phys, size, dir,
                                        DMA_ATTR_SKIP_CPU_SYNC);

And, AFAICS, most of these drivers don't see use the BAR addresses directly
if they import a dmabuf that they exported earlier and instead do this:

        if (dma_buf->ops == &xe_dmabuf_ops) {
                obj = dma_buf->priv;
                if (obj->dev == dev &&
                    !XE_TEST_ONLY(test && test->force_different_devices)) {
                        /*
                         * Importing dmabuf exported from out own gem increases
                         * refcount on gem itself instead of f_count of dmabuf.
                         */
                        drm_gem_object_get(obj);
                        return obj;
                }
        }

>The PF never actually uses the VF BAR
That's because the PF can't use it directly, most likely due to hardware limitations.

>it just hackily converts the dma_addr_t back
> to CPU physical and figures out where it is in the VRAM pool and then
> uses a PF centric address for it.
> 
> All they want is either the actual VRAM address or the CPU physical.
The problem here is that the CPU physical (aka BAR Address) is only
usable by the CPU. Since the GPU PF only understands VRAM addresses,
the current exporter (vfio-pci) or any VF/VFIO variant driver cannot provide
the VRAM addresses that the GPU PF can use directly because they do not
have access to the provisioning data.

However, it is possible that if vfio-pci or a VF/VFIO variant driver had access
to the VF's provisioning data, then it might be able to create a dmabuf with
VRAM addresses that the PF can use directly. But I am not sure if exposing
provisioning data to VFIO drivers is ok from a security standpoint or not.

Thanks,
Vivek

> 
> Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23  5:53                   ` Kasireddy, Vivek
@ 2025-09-23  6:25                     ` Matthew Brost
  2025-09-23  6:44                       ` Matthew Brost
  0 siblings, 1 reply; 46+ messages in thread
From: Matthew Brost @ 2025-09-23  6:25 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Jason Gunthorpe, Christian König, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Mon, Sep 22, 2025 at 11:53:06PM -0600, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> > functions of Intel GPUs
> > 
> > On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
> > 
> > > Well what exactly is happening here? You have a PF assigned to the
> > > host and a VF passed through to a guest, correct?
> > >
> > > And now the PF (from the host side) wants to access a BAR of the VF?
> > 
> > Not quite.
> > 
> > It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
> > the VF can access some VRAM.
> > 
> > They want to get a DMABUF handle for a bit of the VF's reachable VRAM
> > that the PF can import and use through it's own funciton.
> > 
> > The use of the VF's BAR in this series is an ugly hack.
> IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> to never expose VRAM Addresses and instead have BAR addresses as DMA
> addresses when exporting dmabufs to other devices. Here is the relevant code
> snippet in Xe:
>                 phys_addr_t phys = cursor.start + xe_vram_region_io_start(tile->mem.vram);             
>                 size_t size = min_t(u64, cursor.size, SZ_2G);                         
>                 dma_addr_t addr;                                                      
>                                                                                       
>                 addr = dma_map_resource(dev, phys, size, dir,                         
>                                         DMA_ATTR_SKIP_CPU_SYNC);
> 
> And, here is the one in amdgpu:
>         for_each_sgtable_sg((*sgt), sg, i) {
>                 phys_addr_t phys = cursor.start + adev->gmc.aper_base;
>                 unsigned long size = min(cursor.size, AMDGPU_MAX_SG_SEGMENT_SIZE);
>                 dma_addr_t addr;
> 
>                 addr = dma_map_resource(dev, phys, size, dir,
>                                         DMA_ATTR_SKIP_CPU_SYNC);
> 

I've read through this thread—Jason, correct me if I'm wrong—but I
believe what you're suggesting is that instead of using PCIe P2P
(dma_map_resource) to communicate the VF's VRAM offset to the PF, we
should teach dma-buf to natively understand a VF's VRAM offset. I don't
think this is currently built into dma-buf, but it probably should be,
as it could benefit other use cases as well (e.g., UALink, NVLink,
etc.).

In both examples above, the PCIe P2P fabric is used for communication,
whereas in the VF→PF case, it's only using the PCIe P2P address to
extract the VF's VRAM offset, rather than serving as a communication
path. I believe that's Jason's objection. Again, Jason, correct me if
I'm misunderstanding here.

Assuming I'm understanding Jason's comments correctly, I tend to agree
with him.

> And, AFAICS, most of these drivers don't see use the BAR addresses directly
> if they import a dmabuf that they exported earlier and instead do this:
> 
>         if (dma_buf->ops == &xe_dmabuf_ops) {
>                 obj = dma_buf->priv;
>                 if (obj->dev == dev &&
>                     !XE_TEST_ONLY(test && test->force_different_devices)) {
>                         /*
>                          * Importing dmabuf exported from out own gem increases
>                          * refcount on gem itself instead of f_count of dmabuf.
>                          */
>                         drm_gem_object_get(obj);
>                         return obj;
>                 }
>         }

This code won't be triggered on the VF→PF path, as obj->dev == dev will
fail.

> 
> >The PF never actually uses the VF BAR
> That's because the PF can't use it directly, most likely due to hardware limitations.
> 
> >it just hackily converts the dma_addr_t back
> > to CPU physical and figures out where it is in the VRAM pool and then
> > uses a PF centric address for it.
> > 
> > All they want is either the actual VRAM address or the CPU physical.
> The problem here is that the CPU physical (aka BAR Address) is only
> usable by the CPU. Since the GPU PF only understands VRAM addresses,
> the current exporter (vfio-pci) or any VF/VFIO variant driver cannot provide
> the VRAM addresses that the GPU PF can use directly because they do not
> have access to the provisioning data.
>

Right, we need to provide the offset within the VRAM provisioning, which
the PF can resolve to a physical address based on the provisioning data.
The series already does this—the problem is how the VF provides
this offset. It shouldn't be a P2P address, but rather a native
dma-buf-provided offset that everyone involved in the attachment
understands.
 
> However, it is possible that if vfio-pci or a VF/VFIO variant driver had access
> to the VF's provisioning data, then it might be able to create a dmabuf with
> VRAM addresses that the PF can use directly. But I am not sure if exposing
> provisioning data to VFIO drivers is ok from a security standpoint or not.
> 

I'd prefer to leave the provisioning data to the PF if possible. I
haven't fully wrapped my head around the flow yet, but it should be
feasible for the VF → VFIO → PF path to pass along the initial VF
scatter-gather (SG) list in the dma-buf, which includes VF-specific
PFNs. The PF can then use this, along with its provisioning information,
to resolve the physical address.

Matt

> Thanks,
> Vivek
> 
> > 
> > Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23  6:25                     ` Matthew Brost
@ 2025-09-23  6:44                       ` Matthew Brost
  2025-09-23  7:52                         ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Matthew Brost @ 2025-09-23  6:44 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Jason Gunthorpe, Christian König, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Mon, Sep 22, 2025 at 11:25:47PM -0700, Matthew Brost wrote:
> On Mon, Sep 22, 2025 at 11:53:06PM -0600, Kasireddy, Vivek wrote:
> > Hi Jason,
> > 
> > > Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> > > functions of Intel GPUs
> > > 
> > > On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
> > > 
> > > > Well what exactly is happening here? You have a PF assigned to the
> > > > host and a VF passed through to a guest, correct?
> > > >
> > > > And now the PF (from the host side) wants to access a BAR of the VF?
> > > 
> > > Not quite.
> > > 
> > > It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
> > > the VF can access some VRAM.
> > > 
> > > They want to get a DMABUF handle for a bit of the VF's reachable VRAM
> > > that the PF can import and use through it's own funciton.
> > > 
> > > The use of the VF's BAR in this series is an ugly hack.
> > IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> > to never expose VRAM Addresses and instead have BAR addresses as DMA
> > addresses when exporting dmabufs to other devices. Here is the relevant code
> > snippet in Xe:
> >                 phys_addr_t phys = cursor.start + xe_vram_region_io_start(tile->mem.vram);             
> >                 size_t size = min_t(u64, cursor.size, SZ_2G);                         
> >                 dma_addr_t addr;                                                      
> >                                                                                       
> >                 addr = dma_map_resource(dev, phys, size, dir,                         
> >                                         DMA_ATTR_SKIP_CPU_SYNC);
> > 
> > And, here is the one in amdgpu:
> >         for_each_sgtable_sg((*sgt), sg, i) {
> >                 phys_addr_t phys = cursor.start + adev->gmc.aper_base;
> >                 unsigned long size = min(cursor.size, AMDGPU_MAX_SG_SEGMENT_SIZE);
> >                 dma_addr_t addr;
> > 
> >                 addr = dma_map_resource(dev, phys, size, dir,
> >                                         DMA_ATTR_SKIP_CPU_SYNC);
> > 
> 
> I've read through this thread—Jason, correct me if I'm wrong—but I
> believe what you're suggesting is that instead of using PCIe P2P
> (dma_map_resource) to communicate the VF's VRAM offset to the PF, we
> should teach dma-buf to natively understand a VF's VRAM offset. I don't
> think this is currently built into dma-buf, but it probably should be,
> as it could benefit other use cases as well (e.g., UALink, NVLink,
> etc.).
> 
> In both examples above, the PCIe P2P fabric is used for communication,
> whereas in the VF→PF case, it's only using the PCIe P2P address to
> extract the VF's VRAM offset, rather than serving as a communication
> path. I believe that's Jason's objection. Again, Jason, correct me if
> I'm misunderstanding here.
> 
> Assuming I'm understanding Jason's comments correctly, I tend to agree
> with him.
> 
> > And, AFAICS, most of these drivers don't see use the BAR addresses directly
> > if they import a dmabuf that they exported earlier and instead do this:
> > 
> >         if (dma_buf->ops == &xe_dmabuf_ops) {

Sorry - double reply but the above check would also fail on the VF→PF path.

Matt

> >                 obj = dma_buf->priv;
> >                 if (obj->dev == dev &&
> >                     !XE_TEST_ONLY(test && test->force_different_devices)) {
> >                         /*
> >                          * Importing dmabuf exported from out own gem increases
> >                          * refcount on gem itself instead of f_count of dmabuf.
> >                          */
> >                         drm_gem_object_get(obj);
> >                         return obj;
> >                 }
> >         }
> 
> This code won't be triggered on the VF→PF path, as obj->dev == dev will
> fail.
> 
> > 
> > >The PF never actually uses the VF BAR
> > That's because the PF can't use it directly, most likely due to hardware limitations.
> > 
> > >it just hackily converts the dma_addr_t back
> > > to CPU physical and figures out where it is in the VRAM pool and then
> > > uses a PF centric address for it.
> > > 
> > > All they want is either the actual VRAM address or the CPU physical.
> > The problem here is that the CPU physical (aka BAR Address) is only
> > usable by the CPU. Since the GPU PF only understands VRAM addresses,
> > the current exporter (vfio-pci) or any VF/VFIO variant driver cannot provide
> > the VRAM addresses that the GPU PF can use directly because they do not
> > have access to the provisioning data.
> >
> 
> Right, we need to provide the offset within the VRAM provisioning, which
> the PF can resolve to a physical address based on the provisioning data.
> The series already does this—the problem is how the VF provides
> this offset. It shouldn't be a P2P address, but rather a native
> dma-buf-provided offset that everyone involved in the attachment
> understands.
>  
> > However, it is possible that if vfio-pci or a VF/VFIO variant driver had access
> > to the VF's provisioning data, then it might be able to create a dmabuf with
> > VRAM addresses that the PF can use directly. But I am not sure if exposing
> > provisioning data to VFIO drivers is ok from a security standpoint or not.
> > 
> 
> I'd prefer to leave the provisioning data to the PF if possible. I
> haven't fully wrapped my head around the flow yet, but it should be
> feasible for the VF → VFIO → PF path to pass along the initial VF
> scatter-gather (SG) list in the dma-buf, which includes VF-specific
> PFNs. The PF can then use this, along with its provisioning information,
> to resolve the physical address.
> 
> Matt
> 
> > Thanks,
> > Vivek
> > 
> > > 
> > > Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23  6:44                       ` Matthew Brost
@ 2025-09-23  7:52                         ` Christian König
  2025-09-23 12:15                           ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-23  7:52 UTC (permalink / raw)
  To: Matthew Brost, Kasireddy, Vivek
  Cc: Jason Gunthorpe, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org, Thomas Hellström

Hi guys,

trying to not let the mail thread branch to much, I'm just replying on the newest mail.

Please let me know if I missed some question.

On 23.09.25 08:44, Matthew Brost wrote:
> On Mon, Sep 22, 2025 at 11:25:47PM -0700, Matthew Brost wrote:
>> On Mon, Sep 22, 2025 at 11:53:06PM -0600, Kasireddy, Vivek wrote:
>>> Hi Jason,
>>>
>>>> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
>>>> functions of Intel GPUs
>>>>
>>>> On Mon, Sep 22, 2025 at 01:22:49PM +0200, Christian König wrote:
>>>>
>>>>> Well what exactly is happening here? You have a PF assigned to the
>>>>> host and a VF passed through to a guest, correct?
>>>>>
>>>>> And now the PF (from the host side) wants to access a BAR of the VF?
>>>>
>>>> Not quite.
>>>>
>>>> It is a GPU so it has a pool of VRAM. The PF can access all VRAM and
>>>> the VF can access some VRAM.
>>>>
>>>> They want to get a DMABUF handle for a bit of the VF's reachable VRAM
>>>> that the PF can import and use through it's own funciton.
>>>>
>>>> The use of the VF's BAR in this series is an ugly hack.
>>> IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
>>> to never expose VRAM Addresses and instead have BAR addresses as DMA
>>> addresses when exporting dmabufs to other devices. Here is the relevant code
>>> snippet in Xe:

That sounds a bit mixed up. There are two different concepts which can be used here:

1. Driver exposing DMA addresses to PCIe BARs.

For example this is done by amdgpu and XE to give other drivers access to MMIO registers as well as VRAM when it isn't backed by struct pages.

2. Drivers short cutting internally access paths.

This is used in amdgpu and a lot of other drivers when it finds that an DMA-buf was exported by itself.

For example the ISP driver part of amdgpu provides the V4L2 interface and when we interchange a DMA-buf with it we recognize that it is actually the same device we work with.

Currently the implementation is based on approach #1, but as far as I can see what's actually needed is approach #2.

>> I've read through this thread—Jason, correct me if I'm wrong—but I
>> believe what you're suggesting is that instead of using PCIe P2P
>> (dma_map_resource) to communicate the VF's VRAM offset to the PF, we
>> should teach dma-buf to natively understand a VF's VRAM offset. I don't
>> think this is currently built into dma-buf, but it probably should be,
>> as it could benefit other use cases as well (e.g., UALink, NVLink,
>> etc.).
>>
>> In both examples above, the PCIe P2P fabric is used for communication,
>> whereas in the VF→PF case, it's only using the PCIe P2P address to
>> extract the VF's VRAM offset, rather than serving as a communication
>> path. I believe that's Jason's objection. Again, Jason, correct me if
>> I'm misunderstanding here.
>>
>> Assuming I'm understanding Jason's comments correctly, I tend to agree
>> with him.

Yeah, agree that here is just an extremely ugly hack.

>>>> The PF never actually uses the VF BAR
>>> That's because the PF can't use it directly, most likely due to hardware limitations.
>>>
>>>> it just hackily converts the dma_addr_t back
>>>> to CPU physical and figures out where it is in the VRAM pool and then
>>>> uses a PF centric address for it.
>>>>
>>>> All they want is either the actual VRAM address or the CPU physical.
>>> The problem here is that the CPU physical (aka BAR Address) is only
>>> usable by the CPU. Since the GPU PF only understands VRAM addresses,
>>> the current exporter (vfio-pci) or any VF/VFIO variant driver cannot provide
>>> the VRAM addresses that the GPU PF can use directly because they do not
>>> have access to the provisioning data.
>>>
>>
>> Right, we need to provide the offset within the VRAM provisioning, which
>> the PF can resolve to a physical address based on the provisioning data.
>> The series already does this—the problem is how the VF provides
>> this offset. It shouldn't be a P2P address, but rather a native
>> dma-buf-provided offset that everyone involved in the attachment
>> understands.

What you can do is to either export the DMA-buf from the driver who feels responsible for the PF directly (that's what we do in amdgpu because the VRAM is actually not fully accessible through the BAR).

Or you could extend the VFIO driver with a private interface for the PF to exposing the offsets into the BAR instead of the DMA addresses.

>>  
>>> However, it is possible that if vfio-pci or a VF/VFIO variant driver had access
>>> to the VF's provisioning data, then it might be able to create a dmabuf with
>>> VRAM addresses that the PF can use directly. But I am not sure if exposing
>>> provisioning data to VFIO drivers is ok from a security standpoint or not.
>>>

How are those offsets into the BAR communicated from the guest to the host in the first place?

>> I'd prefer to leave the provisioning data to the PF if possible. I
>> haven't fully wrapped my head around the flow yet, but it should be
>> feasible for the VF → VFIO → PF path to pass along the initial VF
>> scatter-gather (SG) list in the dma-buf, which includes VF-specific
>> PFNs. The PF can then use this, along with its provisioning information,
>> to resolve the physical address.

Well don't put that into the sg_table but rather into an xarray or similar, but in general that's the correct idea.

Regards,
Christian.

>>
>> Matt
>>
>>> Thanks,
>>> Vivek
>>>
>>>>
>>>> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23  7:52                         ` Christian König
@ 2025-09-23 12:15                           ` Jason Gunthorpe
  2025-09-23 12:45                             ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-23 12:15 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Tue, Sep 23, 2025 at 09:52:04AM +0200, Christian König wrote:
> For example the ISP driver part of amdgpu provides the V4L2
> interface and when we interchange a DMA-buf with it we recognize that
> it is actually the same device we work with.

One of the issues here is the mis-use of dma_map_resource() to create
dma_addr_t for PCI devices. This was never correct.

VFIO is using a new correct ACS aware DMA mapping API that I would
expect all the DMABUF world to slowly migrate to. This API prevents
mappings in cases that don't work in HW.

So a design where you have to DMA map something then throw away the
DMA map after doing some "shortcut" check isn't going to work.

We need some way for the importer/exporter to negotiate what kind of
address they want to exchange without forcing a dma mapping.

> >> I've read through this thread—Jason, correct me if I'm wrong—but I
> >> believe what you're suggesting is that instead of using PCIe P2P
> >> (dma_map_resource) to communicate the VF's VRAM offset to the PF, we
> >> should teach dma-buf to natively understand a VF's VRAM offset. I don't
> >> think this is currently built into dma-buf, but it probably should be,
> >> as it could benefit other use cases as well (e.g., UALink, NVLink,
> >> etc.).
> >>
> >> In both examples above, the PCIe P2P fabric is used for communication,
> >> whereas in the VF→PF case, it's only using the PCIe P2P address to
> >> extract the VF's VRAM offset, rather than serving as a communication
> >> path. I believe that's Jason's objection. Again, Jason, correct me if
> >> I'm misunderstanding here.

Yes, this is my point.

We have many cases now where a dma_addr_t is not the appropriate way
to exchange addressing information from importer/exporter and we need
more flexibility.

I also consider the KVM and iommufd use cases that must have a
phys_addr_t in this statement.

> What you can do is to either export the DMA-buf from the driver who
> feels responsible for the PF directly (that's what we do in amdgpu
> because the VRAM is actually not fully accessible through the BAR).

Again, considering security somehow as there should not be uAPI to
just give uncontrolled access to VRAM.

From a security side having the VF create the DMABUF is better as you
get that security proof that it is permitted to access the VRAM.

From this thread I think if VFIO had the negotiated option to export a
CPU phys_addr_t then the Xe PF driver can reliably convert that to a
VRAM offset.

We need to add a CPU phys_addr_t option for VFIO to iommufd and KVM
anyhow, those cases can't use dma_addr_t.

> >> I'd prefer to leave the provisioning data to the PF if possible. I
> >> haven't fully wrapped my head around the flow yet, but it should be
> >> feasible for the VF → VFIO → PF path to pass along the initial VF
> >> scatter-gather (SG) list in the dma-buf, which includes VF-specific
> >> PFNs. The PF can then use this, along with its provisioning information,
> >> to resolve the physical address.
> 
> Well don't put that into the sg_table but rather into an xarray or
> similar, but in general that's the correct idea.

Yes, please lets move away from re-using dma_addr_t to represent
things that are not created by the DMA API.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 12:15                           ` Jason Gunthorpe
@ 2025-09-23 12:45                             ` Christian König
  2025-09-23 13:12                               ` Jason Gunthorpe
  2025-09-23 13:36                               ` Christoph Hellwig
  0 siblings, 2 replies; 46+ messages in thread
From: Christian König @ 2025-09-23 12:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On 23.09.25 14:15, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 09:52:04AM +0200, Christian König wrote:
>> For example the ISP driver part of amdgpu provides the V4L2
>> interface and when we interchange a DMA-buf with it we recognize that
>> it is actually the same device we work with.
> 
> One of the issues here is the mis-use of dma_map_resource() to create
> dma_addr_t for PCI devices. This was never correct.

That is not a mis-use at all but rather exactly what dma_map_resource() was created for.

If dma_map_resource() is not ACS aware than we should add that.

> VFIO is using a new correct ACS aware DMA mapping API that I would
> expect all the DMABUF world to slowly migrate to. This API prevents
> mappings in cases that don't work in HW.
> 
> So a design where you have to DMA map something then throw away the
> DMA map after doing some "shortcut" check isn't going to work.
> 
> We need some way for the importer/exporter to negotiate what kind of
> address they want to exchange without forcing a dma mapping.

That is already in place. We don't DMA map anything in those use cases.

>>>> I've read through this thread—Jason, correct me if I'm wrong—but I
>>>> believe what you're suggesting is that instead of using PCIe P2P
>>>> (dma_map_resource) to communicate the VF's VRAM offset to the PF, we
>>>> should teach dma-buf to natively understand a VF's VRAM offset. I don't
>>>> think this is currently built into dma-buf, but it probably should be,
>>>> as it could benefit other use cases as well (e.g., UALink, NVLink,
>>>> etc.).
>>>>
>>>> In both examples above, the PCIe P2P fabric is used for communication,
>>>> whereas in the VF→PF case, it's only using the PCIe P2P address to
>>>> extract the VF's VRAM offset, rather than serving as a communication
>>>> path. I believe that's Jason's objection. Again, Jason, correct me if
>>>> I'm misunderstanding here.
> 
> Yes, this is my point.
> 
> We have many cases now where a dma_addr_t is not the appropriate way
> to exchange addressing information from importer/exporter and we need
> more flexibility.
> 
> I also consider the KVM and iommufd use cases that must have a
> phys_addr_t in this statement.

Abusing phys_addr_t is also the completely wrong approach in that moment.

When you want to communicate addresses in a device specific address space you need a device specific type for that and not abuse phys_addr_t.

>> What you can do is to either export the DMA-buf from the driver who
>> feels responsible for the PF directly (that's what we do in amdgpu
>> because the VRAM is actually not fully accessible through the BAR).
> 
> Again, considering security somehow as there should not be uAPI to
> just give uncontrolled access to VRAM.
> 
> From a security side having the VF create the DMABUF is better as you
> get that security proof that it is permitted to access the VRAM.

Well the VF is basically just a window into the HW of the PF.

The real question is where does the VFIO gets the necessary information which parts of the BAR to expose?

> From this thread I think if VFIO had the negotiated option to export a
> CPU phys_addr_t then the Xe PF driver can reliably convert that to a
> VRAM offset.
> 
> We need to add a CPU phys_addr_t option for VFIO to iommufd and KVM
> anyhow, those cases can't use dma_addr_t.

Clear NAK to using CPU phys_addr_t. This is just a horrible idea.

Regards,
Christian.

> 
>>>> I'd prefer to leave the provisioning data to the PF if possible. I
>>>> haven't fully wrapped my head around the flow yet, but it should be
>>>> feasible for the VF → VFIO → PF path to pass along the initial VF
>>>> scatter-gather (SG) list in the dma-buf, which includes VF-specific
>>>> PFNs. The PF can then use this, along with its provisioning information,
>>>> to resolve the physical address.
>>
>> Well don't put that into the sg_table but rather into an xarray or
>> similar, but in general that's the correct idea.
> 
> Yes, please lets move away from re-using dma_addr_t to represent
> things that are not created by the DMA API.
> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 12:45                             ` Christian König
@ 2025-09-23 13:12                               ` Jason Gunthorpe
  2025-09-23 13:28                                 ` Christian König
  2025-09-23 13:36                               ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-23 13:12 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Tue, Sep 23, 2025 at 02:45:10PM +0200, Christian König wrote:
> On 23.09.25 14:15, Jason Gunthorpe wrote:
> > On Tue, Sep 23, 2025 at 09:52:04AM +0200, Christian König wrote:
> >> For example the ISP driver part of amdgpu provides the V4L2
> >> interface and when we interchange a DMA-buf with it we recognize that
> >> it is actually the same device we work with.
> > 
> > One of the issues here is the mis-use of dma_map_resource() to create
> > dma_addr_t for PCI devices. This was never correct.
> 
> That is not a mis-use at all but rather exactly what
> dma_map_resource() was created for.

No, it isn't this is a misunderstanding. It was created for SOC
resources only. I think HCH made this clear a number of times.

> If dma_map_resource() is not ACS aware than we should add that.

It can't be fixed with the API it has. See how the new VFIO patches
are working to understand the proposal.
 
> > We have many cases now where a dma_addr_t is not the appropriate way
> > to exchange addressing information from importer/exporter and we need
> > more flexibility.
> > 
> > I also consider the KVM and iommufd use cases that must have a
> > phys_addr_t in this statement.
> 
> Abusing phys_addr_t is also the completely wrong approach in that moment.
> 
> When you want to communicate addresses in a device specific address
> space you need a device specific type for that and not abuse
> phys_addr_t.

I'm not talking about abusing phys_addr_t, I'm talking about putting a
legitimate CPU address in there.

You can argue it is hack in Xe to reverse engineer the VRAM offset
from a CPU physical, and I would be sympathetic, but it does allow
VFIO to be general not specialized to Xe.

> The real question is where does the VFIO gets the necessary
> information which parts of the BAR to expose?

It needs a varaint driver that understands to reach into the PF parent
and extract this information.

There is a healthy amount of annoyance to building something like this.
 
> > From this thread I think if VFIO had the negotiated option to export a
> > CPU phys_addr_t then the Xe PF driver can reliably convert that to a
> > VRAM offset.
> > 
> > We need to add a CPU phys_addr_t option for VFIO to iommufd and KVM
> > anyhow, those cases can't use dma_addr_t.
> 
> Clear NAK to using CPU phys_addr_t. This is just a horrible idea.

We already talked about this, Simona agreed, we need to get
phys_addr_t optionally out of VFIO's dmabuf for a few importers. We
cannot use dma_addr_t.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 13:12                               ` Jason Gunthorpe
@ 2025-09-23 13:28                                 ` Christian König
  2025-09-23 13:38                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-23 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On 23.09.25 15:12, Jason Gunthorpe wrote:
>> When you want to communicate addresses in a device specific address
>> space you need a device specific type for that and not abuse
>> phys_addr_t.
> 
> I'm not talking about abusing phys_addr_t, I'm talking about putting a
> legitimate CPU address in there.
> 
> You can argue it is hack in Xe to reverse engineer the VRAM offset
> from a CPU physical, and I would be sympathetic, but it does allow
> VFIO to be general not specialized to Xe.

No, exactly that doesn't work for all use cases. That's why I'm pushing back so hard on using phys_addr_t or CPU addresses.

See the CPU address is only valid temporary because the VF BAR is only a window into the device memory.

This window is open as long as the CPU is using it, but as soon as that is not the case any more that window might close creating tons of lifetime issues.

>> The real question is where does the VFIO gets the necessary
>> information which parts of the BAR to expose?
> 
> It needs a varaint driver that understands to reach into the PF parent
> and extract this information.
> 
> There is a healthy amount of annoyance to building something like this.
>  
>>> From this thread I think if VFIO had the negotiated option to export a
>>> CPU phys_addr_t then the Xe PF driver can reliably convert that to a
>>> VRAM offset.
>>>
>>> We need to add a CPU phys_addr_t option for VFIO to iommufd and KVM
>>> anyhow, those cases can't use dma_addr_t.
>>
>> Clear NAK to using CPU phys_addr_t. This is just a horrible idea.
> 
> We already talked about this, Simona agreed, we need to get
> phys_addr_t optionally out of VFIO's dmabuf for a few importers. We
> cannot use dma_addr_t.

Not saying that we should use dma_addr_t, but using phys_addr_t is as equally broken and I will certainly NAK any approach using this as general interface between drivers.

What Simona agreed on is exactly what I proposed as well, that you get a private interface for exactly that use case.

Regards,
Christian.

> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 13:28                                 ` Christian König
@ 2025-09-23 13:38                                   ` Jason Gunthorpe
  2025-09-23 13:48                                     ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-23 13:38 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
> On 23.09.25 15:12, Jason Gunthorpe wrote:
> >> When you want to communicate addresses in a device specific address
> >> space you need a device specific type for that and not abuse
> >> phys_addr_t.
> > 
> > I'm not talking about abusing phys_addr_t, I'm talking about putting a
> > legitimate CPU address in there.
> > 
> > You can argue it is hack in Xe to reverse engineer the VRAM offset
> > from a CPU physical, and I would be sympathetic, but it does allow
> > VFIO to be general not specialized to Xe.
> 
> No, exactly that doesn't work for all use cases. That's why I'm
> pushing back so hard on using phys_addr_t or CPU addresses.
> 
> See the CPU address is only valid temporary because the VF BAR is
> only a window into the device memory.

I know, generally yes.

But there should be no way that a VFIO VF driver in the hypervisor
knows what is currently mapped to the VF's BAR. The only way I can
make sense of what Xe is doing here is if the VF BAR is a static
aperture of the VRAM..

Would be nice to know the details.
 
> What Simona agreed on is exactly what I proposed as well, that you
> get a private interface for exactly that use case.

A "private" interface to exchange phys_addr_t between at least
VFIO/KVM/iommufd - sure no complaint with that.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 13:38                                   ` Jason Gunthorpe
@ 2025-09-23 13:48                                     ` Christian König
  2025-09-23 23:02                                       ` Matthew Brost
  2025-09-24  6:50                                       ` Kasireddy, Vivek
  0 siblings, 2 replies; 46+ messages in thread
From: Christian König @ 2025-09-23 13:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On 23.09.25 15:38, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
>> On 23.09.25 15:12, Jason Gunthorpe wrote:
>>>> When you want to communicate addresses in a device specific address
>>>> space you need a device specific type for that and not abuse
>>>> phys_addr_t.
>>>
>>> I'm not talking about abusing phys_addr_t, I'm talking about putting a
>>> legitimate CPU address in there.
>>>
>>> You can argue it is hack in Xe to reverse engineer the VRAM offset
>>> from a CPU physical, and I would be sympathetic, but it does allow
>>> VFIO to be general not specialized to Xe.
>>
>> No, exactly that doesn't work for all use cases. That's why I'm
>> pushing back so hard on using phys_addr_t or CPU addresses.
>>
>> See the CPU address is only valid temporary because the VF BAR is
>> only a window into the device memory.
> 
> I know, generally yes.
> 
> But there should be no way that a VFIO VF driver in the hypervisor
> knows what is currently mapped to the VF's BAR. The only way I can
> make sense of what Xe is doing here is if the VF BAR is a static
> aperture of the VRAM..
> 
> Would be nice to know the details.

Yeah, that's why i asked how VFIO gets the information which parts of the it's BAR should be part of the DMA-buf?

That would be really interesting to know.

Regards,
Christian.

>  
>> What Simona agreed on is exactly what I proposed as well, that you
>> get a private interface for exactly that use case.
> 
> A "private" interface to exchange phys_addr_t between at least
> VFIO/KVM/iommufd - sure no complaint with that.
> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 13:48                                     ` Christian König
@ 2025-09-23 23:02                                       ` Matthew Brost
  2025-09-24  8:29                                         ` Christian König
  2025-09-24  6:50                                       ` Kasireddy, Vivek
  1 sibling, 1 reply; 46+ messages in thread
From: Matthew Brost @ 2025-09-23 23:02 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Tue, Sep 23, 2025 at 03:48:59PM +0200, Christian König wrote:
> On 23.09.25 15:38, Jason Gunthorpe wrote:
> > On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
> >> On 23.09.25 15:12, Jason Gunthorpe wrote:
> >>>> When you want to communicate addresses in a device specific address
> >>>> space you need a device specific type for that and not abuse
> >>>> phys_addr_t.
> >>>
> >>> I'm not talking about abusing phys_addr_t, I'm talking about putting a
> >>> legitimate CPU address in there.
> >>>
> >>> You can argue it is hack in Xe to reverse engineer the VRAM offset
> >>> from a CPU physical, and I would be sympathetic, but it does allow
> >>> VFIO to be general not specialized to Xe.
> >>
> >> No, exactly that doesn't work for all use cases. That's why I'm
> >> pushing back so hard on using phys_addr_t or CPU addresses.
> >>
> >> See the CPU address is only valid temporary because the VF BAR is
> >> only a window into the device memory.
> > 
> > I know, generally yes.
> > 
> > But there should be no way that a VFIO VF driver in the hypervisor
> > knows what is currently mapped to the VF's BAR. The only way I can
> > make sense of what Xe is doing here is if the VF BAR is a static
> > aperture of the VRAM..
> > 
> > Would be nice to know the details.
> 
> Yeah, that's why i asked how VFIO gets the information which parts of the it's BAR should be part of the DMA-buf?
> 

Vivek can confirm for sure, but I believe the VF knows the size of its
VRAM space and simply wants to pass along the offset and allocation
order within that space. The PF knows where the VF's VRAM is located in
the BAR, and the combination of the VF base offset and the individual
allocation offset is what gets programmed into the PF page tables.

> That would be really interesting to know.
> 
> Regards,
> Christian.
> 
> >  
> >> What Simona agreed on is exactly what I proposed as well, that you
> >> get a private interface for exactly that use case.

Do you have a link to the conversation with Simona? I'd lean towards a
kernel-wide generic interface if possible.

Regarding phys_addr_t vs. dma_addr_t, I don't have a strong opinion. But
what about using an array of unsigned long with the order encoded
similarly to HMM PFNs? Drivers can interpret the address portion of the
data based on their individual use cases.

Also, to make this complete—do you think we'd need to teach TTM to
understand this new type of dma-buf, like we do for SG list dma-bufs? It
would seem a bit pointless if we just had to convert this new dma-buf
back into an SG list to pass it along to TTM.

The scope of this seems considerably larger than the original series. It
would be good for all stakeholders to reach an agreement so Vivek can
move forward.

Matt

> > 
> > A "private" interface to exchange phys_addr_t between at least
> > VFIO/KVM/iommufd - sure no complaint with that.
> > 
> > Jason
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 23:02                                       ` Matthew Brost
@ 2025-09-24  8:29                                         ` Christian König
  0 siblings, 0 replies; 46+ messages in thread
From: Christian König @ 2025-09-24  8:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Jason Gunthorpe, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On 24.09.25 01:02, Matthew Brost wrote:
>>>> What Simona agreed on is exactly what I proposed as well, that you
>>>> get a private interface for exactly that use case.
> 
> Do you have a link to the conversation with Simona? I'd lean towards a
> kernel-wide generic interface if possible.

Oh, finding that exactly mail is tricky.

IIRC she wrote something along the lines of "this should be done in a vfio/iommufd interface", but maybe my memory is a bit selective.

We can of course still leverage the DMA-buf lifetime, synchronization and other functionalities.

But this is so vfio specific that this is not going to fly as general DMA-buf interface I think.

> Regarding phys_addr_t vs. dma_addr_t, I don't have a strong opinion. But
> what about using an array of unsigned long with the order encoded
> similarly to HMM PFNs? Drivers can interpret the address portion of the
> data based on their individual use cases.

That's basically what I had in mind for replacing the sg_table.

> Also, to make this complete—do you think we'd need to teach TTM to
> understand this new type of dma-buf, like we do for SG list dma-bufs? It
> would seem a bit pointless if we just had to convert this new dma-buf
> back into an SG list to pass it along to TTM.

Using an sg_table / SG list in DMA-buf and TTM was a bad idea to begin with. At least for amdgpu we have switched over to just have that around temporary for most use cases.

What we need is a container for efficient dma_addr_t storage (e.g. using low bits for the size/order of the area). Then iterators/cursors to go over that container.

Switching between an sg_table and that new container is then just switching out the iterators.

> The scope of this seems considerably larger than the original series. It
> would be good for all stakeholders to reach an agreement so Vivek can
> move forward.

Yeah, agree.

Regards,
Christian.

> 
> Matt
> 
>>>
>>> A "private" interface to exchange phys_addr_t between at least
>>> VFIO/KVM/iommufd - sure no complaint with that.
>>>
>>> Jason
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 13:48                                     ` Christian König
  2025-09-23 23:02                                       ` Matthew Brost
@ 2025-09-24  6:50                                       ` Kasireddy, Vivek
  2025-09-24  7:21                                         ` Christian König
  1 sibling, 1 reply; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-24  6:50 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org, Thomas Hellström

Hi Christian,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On 23.09.25 15:38, Jason Gunthorpe wrote:
> > On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
> >> On 23.09.25 15:12, Jason Gunthorpe wrote:
> >>>> When you want to communicate addresses in a device specific address
> >>>> space you need a device specific type for that and not abuse
> >>>> phys_addr_t.
> >>>
> >>> I'm not talking about abusing phys_addr_t, I'm talking about putting a
> >>> legitimate CPU address in there.
> >>>
> >>> You can argue it is hack in Xe to reverse engineer the VRAM offset
> >>> from a CPU physical, and I would be sympathetic, but it does allow
> >>> VFIO to be general not specialized to Xe.
> >>
> >> No, exactly that doesn't work for all use cases. That's why I'm
> >> pushing back so hard on using phys_addr_t or CPU addresses.
> >>
> >> See the CPU address is only valid temporary because the VF BAR is
> >> only a window into the device memory.
> >
> > I know, generally yes.
> >
> > But there should be no way that a VFIO VF driver in the hypervisor
> > knows what is currently mapped to the VF's BAR. The only way I can
> > make sense of what Xe is doing here is if the VF BAR is a static
> > aperture of the VRAM..
> >
> > Would be nice to know the details.
> 
> Yeah, that's why i asked how VFIO gets the information which parts of the
> it's BAR should be part of the DMA-buf?
> 
> That would be really interesting to know.
As Jason guessed, we are relying on the GPU VF being a Large BAR
device here. In other words, as you suggested, this will not work if the
VF BAR size is not as big as its actual VRAM portion. We can certainly add
this check but we have not seen either the GPU PF or VF getting detected
as Small BAR devices in various test environments.

So, given the above, once a VF device is bound to vfio-pci driver and
assigned to a Guest VM (launched via Qemu), Qemu's vfio layer maps
all the VF's resources including the BARs. This mapping info (specifically HVA)
is leveraged (by Qemu) to identity the offset at which the Guest VM's buffer
is located (in the BAR) and this info is then provided to vfio-pci kernel driver
which finally creates the dmabuf (with BAR Addresses).

Thanks,
Vivek

> 
> Regards,
> Christian.
> 
> >
> >> What Simona agreed on is exactly what I proposed as well, that you
> >> get a private interface for exactly that use case.
> >
> > A "private" interface to exchange phys_addr_t between at least
> > VFIO/KVM/iommufd - sure no complaint with that.
> >
> > Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-24  6:50                                       ` Kasireddy, Vivek
@ 2025-09-24  7:21                                         ` Christian König
  2025-09-25  3:56                                           ` Kasireddy, Vivek
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2025-09-24  7:21 UTC (permalink / raw)
  To: Kasireddy, Vivek, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org, Thomas Hellström

On 24.09.25 08:50, Kasireddy, Vivek wrote:
> Hi Christian,
> 
>> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
>> functions of Intel GPUs
>>
>> On 23.09.25 15:38, Jason Gunthorpe wrote:
>>> On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
>>>> On 23.09.25 15:12, Jason Gunthorpe wrote:
>>>>>> When you want to communicate addresses in a device specific address
>>>>>> space you need a device specific type for that and not abuse
>>>>>> phys_addr_t.
>>>>>
>>>>> I'm not talking about abusing phys_addr_t, I'm talking about putting a
>>>>> legitimate CPU address in there.
>>>>>
>>>>> You can argue it is hack in Xe to reverse engineer the VRAM offset
>>>>> from a CPU physical, and I would be sympathetic, but it does allow
>>>>> VFIO to be general not specialized to Xe.
>>>>
>>>> No, exactly that doesn't work for all use cases. That's why I'm
>>>> pushing back so hard on using phys_addr_t or CPU addresses.
>>>>
>>>> See the CPU address is only valid temporary because the VF BAR is
>>>> only a window into the device memory.
>>>
>>> I know, generally yes.
>>>
>>> But there should be no way that a VFIO VF driver in the hypervisor
>>> knows what is currently mapped to the VF's BAR. The only way I can
>>> make sense of what Xe is doing here is if the VF BAR is a static
>>> aperture of the VRAM..
>>>
>>> Would be nice to know the details.
>>
>> Yeah, that's why i asked how VFIO gets the information which parts of the
>> it's BAR should be part of the DMA-buf?
>>
>> That would be really interesting to know.
> As Jason guessed, we are relying on the GPU VF being a Large BAR
> device here. In other words, as you suggested, this will not work if the
> VF BAR size is not as big as its actual VRAM portion. We can certainly add
> this check but we have not seen either the GPU PF or VF getting detected
> as Small BAR devices in various test environments.
> 
> So, given the above, once a VF device is bound to vfio-pci driver and
> assigned to a Guest VM (launched via Qemu), Qemu's vfio layer maps
> all the VF's resources including the BARs. This mapping info (specifically HVA)
> is leveraged (by Qemu) to identity the offset at which the Guest VM's buffer
> is located (in the BAR) and this info is then provided to vfio-pci kernel driver
> which finally creates the dmabuf (with BAR Addresses).

In that case I strongly suggest to add a private DMA-buf interface for the DMA-bufs exported by vfio-pci which returns which BAR and offset the DMA-buf represents.

Ideally using the same structure Qemu used to provide the offset to the vfio-pci driver, but not a must have.

This way the driver for the GPU PF (XE) can leverage this interface, validates that the DMA-buf comes from a VF it feels responsible for and do the math to figure out in which parts of the VRAM needs to be accessed to scanout the picture.

This way this private vfio-pci interface can also be used by iommufd for example.

Regards,
Christian.

> 
> Thanks,
> Vivek
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>>> What Simona agreed on is exactly what I proposed as well, that you
>>>> get a private interface for exactly that use case.
>>>
>>> A "private" interface to exchange phys_addr_t between at least
>>> VFIO/KVM/iommufd - sure no complaint with that.
>>>
>>> Jason
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-24  7:21                                         ` Christian König
@ 2025-09-25  3:56                                           ` Kasireddy, Vivek
  2025-09-25 10:51                                             ` Thomas Hellström
  0 siblings, 1 reply; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-25  3:56 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org, Thomas Hellström

Hi Christian,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> >>
> >> On 23.09.25 15:38, Jason Gunthorpe wrote:
> >>> On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König wrote:
> >>>> On 23.09.25 15:12, Jason Gunthorpe wrote:
> >>>>>> When you want to communicate addresses in a device specific
> address
> >>>>>> space you need a device specific type for that and not abuse
> >>>>>> phys_addr_t.
> >>>>>
> >>>>> I'm not talking about abusing phys_addr_t, I'm talking about putting a
> >>>>> legitimate CPU address in there.
> >>>>>
> >>>>> You can argue it is hack in Xe to reverse engineer the VRAM offset
> >>>>> from a CPU physical, and I would be sympathetic, but it does allow
> >>>>> VFIO to be general not specialized to Xe.
> >>>>
> >>>> No, exactly that doesn't work for all use cases. That's why I'm
> >>>> pushing back so hard on using phys_addr_t or CPU addresses.
> >>>>
> >>>> See the CPU address is only valid temporary because the VF BAR is
> >>>> only a window into the device memory.
> >>>
> >>> I know, generally yes.
> >>>
> >>> But there should be no way that a VFIO VF driver in the hypervisor
> >>> knows what is currently mapped to the VF's BAR. The only way I can
> >>> make sense of what Xe is doing here is if the VF BAR is a static
> >>> aperture of the VRAM..
> >>>
> >>> Would be nice to know the details.
> >>
> >> Yeah, that's why i asked how VFIO gets the information which parts of the
> >> it's BAR should be part of the DMA-buf?
> >>
> >> That would be really interesting to know.
> > As Jason guessed, we are relying on the GPU VF being a Large BAR
> > device here. In other words, as you suggested, this will not work if the
> > VF BAR size is not as big as its actual VRAM portion. We can certainly add
> > this check but we have not seen either the GPU PF or VF getting detected
> > as Small BAR devices in various test environments.
> >
> > So, given the above, once a VF device is bound to vfio-pci driver and
> > assigned to a Guest VM (launched via Qemu), Qemu's vfio layer maps
> > all the VF's resources including the BARs. This mapping info (specifically
> HVA)
> > is leveraged (by Qemu) to identity the offset at which the Guest VM's buffer
> > is located (in the BAR) and this info is then provided to vfio-pci kernel driver
> > which finally creates the dmabuf (with BAR Addresses).
> 
> In that case I strongly suggest to add a private DMA-buf interface for the DMA-
> bufs exported by vfio-pci which returns which BAR and offset the DMA-buf
> represents.
Does this private dmabuf interface already exist or does it need to be created
from the ground up?

If it already exists, could you please share an example/reference of how you
have used it with amdgpu or other drivers?

If it doesn't exist, I was wondering if it should be based on any particular best
practices/ideas (or design patterns) that already exist in other drivers?

> 
> Ideally using the same structure Qemu used to provide the offset to the vfio-
> pci driver, but not a must have.
> 
> This way the driver for the GPU PF (XE) can leverage this interface, validates
> that the DMA-buf comes from a VF it feels responsible for and do the math to
> figure out in which parts of the VRAM needs to be accessed to scanout the
> picture.
Sounds good. This is definitely a viable path forward and it looks like we are all
in agreement with this idea.

I guess we can start exploring how to implement the private dmabuf interface
mechanism right away.

Thanks,
Vivek

> 
> This way this private vfio-pci interface can also be used by iommufd for
> example.
> 
> Regards,
> Christian.
> 
> >
> > Thanks,
> > Vivek
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>>> What Simona agreed on is exactly what I proposed as well, that you
> >>>> get a private interface for exactly that use case.
> >>>
> >>> A "private" interface to exchange phys_addr_t between at least
> >>> VFIO/KVM/iommufd - sure no complaint with that.
> >>>
> >>> Jason
> >


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25  3:56                                           ` Kasireddy, Vivek
@ 2025-09-25 10:51                                             ` Thomas Hellström
  2025-09-25 11:28                                               ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Hellström @ 2025-09-25 10:51 UTC (permalink / raw)
  To: Kasireddy, Vivek, Christian König, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Thu, 2025-09-25 at 03:56 +0000, Kasireddy, Vivek wrote:
> Hi Christian,
> 
> > Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for
> > device
> > functions of Intel GPUs
> > 
> > > > 
> > > > On 23.09.25 15:38, Jason Gunthorpe wrote:
> > > > > On Tue, Sep 23, 2025 at 03:28:53PM +0200, Christian König
> > > > > wrote:
> > > > > > On 23.09.25 15:12, Jason Gunthorpe wrote:
> > > > > > > > When you want to communicate addresses in a device
> > > > > > > > specific
> > address
> > > > > > > > space you need a device specific type for that and not
> > > > > > > > abuse
> > > > > > > > phys_addr_t.
> > > > > > > 
> > > > > > > I'm not talking about abusing phys_addr_t, I'm talking
> > > > > > > about putting a
> > > > > > > legitimate CPU address in there.
> > > > > > > 
> > > > > > > You can argue it is hack in Xe to reverse engineer the
> > > > > > > VRAM offset
> > > > > > > from a CPU physical, and I would be sympathetic, but it
> > > > > > > does allow
> > > > > > > VFIO to be general not specialized to Xe.
> > > > > > 
> > > > > > No, exactly that doesn't work for all use cases. That's why
> > > > > > I'm
> > > > > > pushing back so hard on using phys_addr_t or CPU addresses.
> > > > > > 
> > > > > > See the CPU address is only valid temporary because the VF
> > > > > > BAR is
> > > > > > only a window into the device memory.
> > > > > 
> > > > > I know, generally yes.
> > > > > 
> > > > > But there should be no way that a VFIO VF driver in the
> > > > > hypervisor
> > > > > knows what is currently mapped to the VF's BAR. The only way
> > > > > I can
> > > > > make sense of what Xe is doing here is if the VF BAR is a
> > > > > static
> > > > > aperture of the VRAM..
> > > > > 
> > > > > Would be nice to know the details.
> > > > 
> > > > Yeah, that's why i asked how VFIO gets the information which
> > > > parts of the
> > > > it's BAR should be part of the DMA-buf?
> > > > 
> > > > That would be really interesting to know.
> > > As Jason guessed, we are relying on the GPU VF being a Large BAR
> > > device here. In other words, as you suggested, this will not work
> > > if the
> > > VF BAR size is not as big as its actual VRAM portion. We can
> > > certainly add
> > > this check but we have not seen either the GPU PF or VF getting
> > > detected
> > > as Small BAR devices in various test environments.
> > > 
> > > So, given the above, once a VF device is bound to vfio-pci driver
> > > and
> > > assigned to a Guest VM (launched via Qemu), Qemu's vfio layer
> > > maps
> > > all the VF's resources including the BARs. This mapping info
> > > (specifically
> > HVA)
> > > is leveraged (by Qemu) to identity the offset at which the Guest
> > > VM's buffer
> > > is located (in the BAR) and this info is then provided to vfio-
> > > pci kernel driver
> > > which finally creates the dmabuf (with BAR Addresses).
> > 
> > In that case I strongly suggest to add a private DMA-buf interface
> > for the DMA-
> > bufs exported by vfio-pci which returns which BAR and offset the
> > DMA-buf
> > represents.

@Christian, Is what you're referring to here the "dma_buf private
interconnect" we've been discussing previously, now only between vfio-
pci and any interested importers instead of private to a known exporter
and importer?

If so I have a POC I can post as an RFC on a way to negotiate such an
interconnect.

> Does this private dmabuf interface already exist or does it need to
> be created
> from the ground up?
> 
> If it already exists, could you please share an example/reference of
> how you
> have used it with amdgpu or other drivers?
> 
> If it doesn't exist, I was wondering if it should be based on any
> particular best
> practices/ideas (or design patterns) that already exist in other
> drivers?

@Vivek, another question: Also on the guest side we're exporting dma-
mapped adresses that are imported and somehow decoded by the guest
virtio-gpu driver? Is something similar needed there?

Also how would the guest side VF driver know that what is assumed to be
a PF on the same device is actually a PF on the same device and not a
completely different device with another driver? (In which case I
assume it would like to export a system dma-buf)?

Thanks,
Thomas



> 
> > 
> > Ideally using the same structure Qemu used to provide the offset to
> > the vfio-
> > pci driver, but not a must have.
> > 
> > This way the driver for the GPU PF (XE) can leverage this
> > interface, validates
> > that the DMA-buf comes from a VF it feels responsible for and do
> > the math to
> > figure out in which parts of the VRAM needs to be accessed to
> > scanout the
> > picture.
> Sounds good. This is definitely a viable path forward and it looks
> like we are all
> in agreement with this idea.
> 
> I guess we can start exploring how to implement the private dmabuf
> interface
> mechanism right away.
> 
> Thanks,
> Vivek
> 
> > 
> > This way this private vfio-pci interface can also be used by
> > iommufd for
> > example.
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > Thanks,
> > > Vivek
> > > 
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > 
> > > > > > What Simona agreed on is exactly what I proposed as well,
> > > > > > that you
> > > > > > get a private interface for exactly that use case.
> > > > > 
> > > > > A "private" interface to exchange phys_addr_t between at
> > > > > least
> > > > > VFIO/KVM/iommufd - sure no complaint with that.
> > > > > 
> > > > > Jason
> > > 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 10:51                                             ` Thomas Hellström
@ 2025-09-25 11:28                                               ` Christian König
  2025-09-25 13:11                                                 ` Thomas Hellström
  2025-09-26  6:12                                                 ` Kasireddy, Vivek
  0 siblings, 2 replies; 46+ messages in thread
From: Christian König @ 2025-09-25 11:28 UTC (permalink / raw)
  To: Thomas Hellström, Kasireddy, Vivek, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On 25.09.25 12:51, Thomas Hellström wrote:
>>> In that case I strongly suggest to add a private DMA-buf interface
>>> for the DMA-
>>> bufs exported by vfio-pci which returns which BAR and offset the
>>> DMA-buf
>>> represents.
> 
> @Christian, Is what you're referring to here the "dma_buf private
> interconnect" we've been discussing previously, now only between vfio-
> pci and any interested importers instead of private to a known exporter
> and importer?
> 
> If so I have a POC I can post as an RFC on a way to negotiate such an
> interconnect.

I was just about to write something up as well, but feel free to go ahead if you already have something.

>> Does this private dmabuf interface already exist or does it need to
>> be created
>> from the ground up?

Every driver which supports both exporting and importing DMA-buf has code to detect when somebody tries to re-import a buffer previously exported from the same device.

Now some drivers like amdgpu and I think XE as well also detect if the buffer is from another device handled by the same driver which potentially have private interconnects (XGMI or similar).

See function amdgpu_dmabuf_is_xgmi_accessible() in amdgpu_dma_buf.c for an example.

>> If it already exists, could you please share an example/reference of
>> how you
>> have used it with amdgpu or other drivers?

Well what's new is that we need to do this between two drivers unreleated to each other.

As far as I know previously that was all inside AMD drivers for example, while in this case vfio is a common vendor agnostic driver.

So we should probably make sure to get that right and vendor agnostic etc....

>> If it doesn't exist, I was wondering if it should be based on any
>> particular best
>> practices/ideas (or design patterns) that already exist in other
>> drivers?
> 
> @Vivek, another question: Also on the guest side we're exporting dma-
> mapped adresses that are imported and somehow decoded by the guest
> virtio-gpu driver? Is something similar needed there?
> 
> Also how would the guest side VF driver know that what is assumed to be
> a PF on the same device is actually a PF on the same device and not a
> completely different device with another driver? (In which case I
> assume it would like to export a system dma-buf)?

Another question is how is lifetime handled? E.g. does the guest know that a DMA-buf exists for it's BAR area?

Regards,
Christian.

> 
> Thanks,
> Thomas
> 
> 
> 
>>
>>>
>>> Ideally using the same structure Qemu used to provide the offset to
>>> the vfio-
>>> pci driver, but not a must have.
>>>
>>> This way the driver for the GPU PF (XE) can leverage this
>>> interface, validates
>>> that the DMA-buf comes from a VF it feels responsible for and do
>>> the math to
>>> figure out in which parts of the VRAM needs to be accessed to
>>> scanout the
>>> picture.
>> Sounds good. This is definitely a viable path forward and it looks
>> like we are all
>> in agreement with this idea.
>>
>> I guess we can start exploring how to implement the private dmabuf
>> interface
>> mechanism right away.
>>
>> Thanks,
>> Vivek
>>
>>>
>>> This way this private vfio-pci interface can also be used by
>>> iommufd for
>>> example.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Vivek
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>> What Simona agreed on is exactly what I proposed as well,
>>>>>>> that you
>>>>>>> get a private interface for exactly that use case.
>>>>>>
>>>>>> A "private" interface to exchange phys_addr_t between at
>>>>>> least
>>>>>> VFIO/KVM/iommufd - sure no complaint with that.
>>>>>>
>>>>>> Jason
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 11:28                                               ` Christian König
@ 2025-09-25 13:11                                                 ` Thomas Hellström
  2025-09-25 13:33                                                   ` Jason Gunthorpe
  2025-09-26  6:12                                                 ` Kasireddy, Vivek
  1 sibling, 1 reply; 46+ messages in thread
From: Thomas Hellström @ 2025-09-25 13:11 UTC (permalink / raw)
  To: Christian König, Kasireddy, Vivek, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Thu, 2025-09-25 at 13:28 +0200, Christian König wrote:
> On 25.09.25 12:51, Thomas Hellström wrote:
> > > > In that case I strongly suggest to add a private DMA-buf
> > > > interface
> > > > for the DMA-
> > > > bufs exported by vfio-pci which returns which BAR and offset
> > > > the
> > > > DMA-buf
> > > > represents.
> > 
> > @Christian, Is what you're referring to here the "dma_buf private
> > interconnect" we've been discussing previously, now only between
> > vfio-
> > pci and any interested importers instead of private to a known
> > exporter
> > and importer?
> > 
> > If so I have a POC I can post as an RFC on a way to negotiate such
> > an
> > interconnect.
> 
> I was just about to write something up as well, but feel free to go
> ahead if you already have something.

Just posted a POC. It might be that you have better ideas, though.


/Thomas


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 13:11                                                 ` Thomas Hellström
@ 2025-09-25 13:33                                                   ` Jason Gunthorpe
  2025-09-25 15:40                                                     ` Thomas Hellström
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-25 13:33 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Christian König, Kasireddy, Vivek, Brost, Matthew,
	Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Thu, Sep 25, 2025 at 03:11:50PM +0200, Thomas Hellström wrote:
> On Thu, 2025-09-25 at 13:28 +0200, Christian König wrote:
> > On 25.09.25 12:51, Thomas Hellström wrote:
> > > > > In that case I strongly suggest to add a private DMA-buf
> > > > > interface
> > > > > for the DMA-
> > > > > bufs exported by vfio-pci which returns which BAR and offset
> > > > > the
> > > > > DMA-buf
> > > > > represents.
> > > 
> > > @Christian, Is what you're referring to here the "dma_buf private
> > > interconnect" we've been discussing previously, now only between
> > > vfio-
> > > pci and any interested importers instead of private to a known
> > > exporter
> > > and importer?
> > > 
> > > If so I have a POC I can post as an RFC on a way to negotiate such
> > > an
> > > interconnect.
> > 
> > I was just about to write something up as well, but feel free to go
> > ahead if you already have something.
> 
> Just posted a POC. It might be that you have better ideas, though.

I think is also needs an API that is not based on scatterlist. Please
lets not push a private interconnect address through the scatterlist
dma_addr_t!

Assuming that you imagine we'd define some global well known
interconnect

'struct blah pci_bar_interconnect {..}'

And if that is negotiated then the non-scatterlist communication would
give the (struct pci_dev *, bar index, bar offset) list?

I think this could solve the kvm/iommufd problems at least!

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 13:33                                                   ` Jason Gunthorpe
@ 2025-09-25 15:40                                                     ` Thomas Hellström
  2025-09-25 15:55                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Hellström @ 2025-09-25 15:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, Kasireddy, Vivek, Brost, Matthew,
	Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Thu, 2025-09-25 at 10:33 -0300, Jason Gunthorpe wrote:
> On Thu, Sep 25, 2025 at 03:11:50PM +0200, Thomas Hellström wrote:
> > On Thu, 2025-09-25 at 13:28 +0200, Christian König wrote:
> > > On 25.09.25 12:51, Thomas Hellström wrote:
> > > > > > In that case I strongly suggest to add a private DMA-buf
> > > > > > interface
> > > > > > for the DMA-
> > > > > > bufs exported by vfio-pci which returns which BAR and
> > > > > > offset
> > > > > > the
> > > > > > DMA-buf
> > > > > > represents.
> > > > 
> > > > @Christian, Is what you're referring to here the "dma_buf
> > > > private
> > > > interconnect" we've been discussing previously, now only
> > > > between
> > > > vfio-
> > > > pci and any interested importers instead of private to a known
> > > > exporter
> > > > and importer?
> > > > 
> > > > If so I have a POC I can post as an RFC on a way to negotiate
> > > > such
> > > > an
> > > > interconnect.
> > > 
> > > I was just about to write something up as well, but feel free to
> > > go
> > > ahead if you already have something.
> > 
> > Just posted a POC. It might be that you have better ideas, though.
> 
> I think is also needs an API that is not based on scatterlist. Please
> lets not push a private interconnect address through the scatterlist
> dma_addr_t!

I think that needs to be defined per interconnect, choosing a data
structure that suits best. Although I find it reasonable to mandate
dma_addr_t or scatterlists to *not* be used.

This merely focuses on the interconnect negotiation itself.

> 
> Assuming that you imagine we'd define some global well known
> interconnect
> 
> 'struct blah pci_bar_interconnect {..}'
> 
> And if that is negotiated then the non-scatterlist communication
> would
> give the (struct pci_dev *, bar index, bar offset) list?

Yes something like that. Although I think perhaps the dev + bar index
might be part of the negotiation, so that it is rejected if the
importer feels that there is no implied PF + VF interconnect. Then the
list would be reduced to only the offset.

Still I think Vivek would be better to figure the exact negotiation and
data structure out.
 
/Thomas

> 
> I think this could solve the kvm/iommufd problems at least!
> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 15:40                                                     ` Thomas Hellström
@ 2025-09-25 15:55                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-25 15:55 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Christian König, Kasireddy, Vivek, Brost, Matthew,
	Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

On Thu, Sep 25, 2025 at 05:40:25PM +0200, Thomas Hellström wrote:
> > I think is also needs an API that is not based on scatterlist. Please
> > lets not push a private interconnect address through the scatterlist
> > dma_addr_t!
> 
> I think that needs to be defined per interconnect, choosing a data
> structure that suits best. Although I find it reasonable to mandate
> dma_addr_t or scatterlists to *not* be used.

Can you include some sketch of how that would look? And cc me please
on future versions :)

> > Assuming that you imagine we'd define some global well known
> > interconnect
> > 
> > 'struct blah pci_bar_interconnect {..}'
> > 
> > And if that is negotiated then the non-scatterlist communication
> > would
> > give the (struct pci_dev *, bar index, bar offset) list?
> 
> Yes something like that. Although I think perhaps the dev + bar index
> might be part of the negotiation, so that it is rejected if the
> importer feels that there is no implied PF + VF interconnect. Then the
> list would be reduced to only the offset.

I'm also happy if the list is just a list of bar offsets for a single
bar and the pci_dev/bar index is discovered by the importer before
getting the list.

That's probably a better API design anyhow since building the list
just to check the pci_dev is wasteful.

In this case a simple vmap of bar offset / len pairs would be nice and
easy place to start.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-25 11:28                                               ` Christian König
  2025-09-25 13:11                                                 ` Thomas Hellström
@ 2025-09-26  6:12                                                 ` Kasireddy, Vivek
  1 sibling, 0 replies; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-26  6:12 UTC (permalink / raw)
  To: Christian König, Thomas Hellström, Jason Gunthorpe
  Cc: Brost, Matthew, Simona Vetter, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Bjorn Helgaas, Logan Gunthorpe,
	linux-pci@vger.kernel.org

Hi Christian, Thomas,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> On 25.09.25 12:51, Thomas Hellström wrote:
> >>> In that case I strongly suggest to add a private DMA-buf interface
> >>> for the DMA-
> >>> bufs exported by vfio-pci which returns which BAR and offset the
> >>> DMA-buf
> >>> represents.
> >
> > @Christian, Is what you're referring to here the "dma_buf private
> > interconnect" we've been discussing previously, now only between vfio-
> > pci and any interested importers instead of private to a known exporter
> > and importer?
> >
> > If so I have a POC I can post as an RFC on a way to negotiate such an
> > interconnect.
I'll start testing with the RFC patches Thomas posted and see how they can
be improved to make them suitable not only for this use-case but also for
the other (iommufd/kvm) use-cases as well.

> 
> I was just about to write something up as well, but feel free to go ahead if
> you already have something.
> 
> >> Does this private dmabuf interface already exist or does it need to
> >> be created
> >> from the ground up?
> 
> Every driver which supports both exporting and importing DMA-buf has
> code to detect when somebody tries to re-import a buffer previously
> exported from the same device.
> 
> Now some drivers like amdgpu and I think XE as well also detect if the buffer
> is from another device handled by the same driver which potentially have
> private interconnects (XGMI or similar).
> 
> See function amdgpu_dmabuf_is_xgmi_accessible() in amdgpu_dma_buf.c
> for an example.
> 
> >> If it already exists, could you please share an example/reference of
> >> how you
> >> have used it with amdgpu or other drivers?
> 
> Well what's new is that we need to do this between two drivers unreleated
> to each other.
Right, that is a key difference.

> 
> As far as I know previously that was all inside AMD drivers for example,
> while in this case vfio is a common vendor agnostic driver.
> 
> So we should probably make sure to get that right and vendor agnostic
> etc....
> 
> >> If it doesn't exist, I was wondering if it should be based on any
> >> particular best
> >> practices/ideas (or design patterns) that already exist in other
> >> drivers?
> >
> > @Vivek, another question: Also on the guest side we're exporting dma-
> > mapped adresses that are imported and somehow decoded by the guest
> > virtio-gpu driver? Is something similar needed there?
AFAICS, nothing else is needed because Qemu is the one that decodes or
resolves the dma-mapped addresses (that are imported by virtio-gpu) and
identifies the right memory region (and its owner, which could be a vfio-dev
or system memory). Details are found in the last patch of this Qemu series:
https://lore.kernel.org/qemu-devel/20250903054438.1179384-1-vivek.kasireddy@intel.com/

> >
> > Also how would the guest side VF driver know that what is assumed to be
> > a PF on the same device is actually a PF on the same device and not a
> > completely different device with another driver? (In which case I
> > assume it would like to export a system dma-buf)?
Good question. AFAICS, there is no definitive way for the Xe VF driver to
know who is the ultimate consumer of its buffer on the Host side. In other
words, the real question is how should it decide whether to create the
dmabuf from VRAM or migrate the backing object to system memory and
then create the dmabuf. Here are a few options I have tried so far:
1) If the importer (virtio-gpu) has allow_peer2peer set to true, and if Xe
is running in VF mode, then assume that PF of the same device is active
on the Host side and thus create a dmabuf from VRAM.

2) Rely on the user (or admin) that is launching Qemu to determine if the PF
on the Host and the VF are compatible (same device) and therefore configure
virtio-gpu and the VF device to be virtual P2P peers like this:
   qemu-system-x86_64 -m 4096m ....
  -device ioh3420,id=root_port1,bus=pcie.0
  -device x3130-upstream,id=upstream1,bus=root_port1
  -device xio3130-downstream,id=downstream1,bus=upstream1,chassis=9
  -device xio3130-downstream,id=downstream2,bus=upstream1,chassis=10
  -device vfio-pci,host=0000:03:00.1,bus=downstream1
  -device virtio-gpu,max_outputs=1,blob=true,xres=1920,yres=1080,bus=downstream2
  -display gtk,gl=on

I am sure there may be better ideas but I think the first option above is a lot
more straightforward. However, currently, virtio-gpu's allow_peer2peer is
always set to true but I'd like to set it to false and add a Qemu option to toggle 
it while launching the VM. This way the user gets to decide (based on what
GPU device is active on the Host) whether the Xe VF driver can create the
dmabuf from system memory or VRAM.

> 
> Another question is how is lifetime handled? E.g. does the guest know that
> a DMA-buf exists for it's BAR area?
Yes, the Guest VM knows that. The virtio-gpu driver (a dynamic importer)
which imports the scanout buffer from Xe VF driver calls dma_buf_pin().
So, the backing object stays pinned until Host/Qemu signals (via a fence)
that it is done accessing (or using) the Guest's buffer.

Also, note that since virtio-gpu registers a move_notify() callback, it can
let Qemu know of any location changes associated with the backing store
of the imported scanout buffer by sending attach_backing and
detach_backing cmds.

Thanks,
Vivek

> 
> Regards,
> Christian.
> 
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >>
> >>>
> >>> Ideally using the same structure Qemu used to provide the offset to
> >>> the vfio-
> >>> pci driver, but not a must have.
> >>>
> >>> This way the driver for the GPU PF (XE) can leverage this
> >>> interface, validates
> >>> that the DMA-buf comes from a VF it feels responsible for and do
> >>> the math to
> >>> figure out in which parts of the VRAM needs to be accessed to
> >>> scanout the
> >>> picture.
> >> Sounds good. This is definitely a viable path forward and it looks
> >> like we are all
> >> in agreement with this idea.
> >>
> >> I guess we can start exploring how to implement the private dmabuf
> >> interface
> >> mechanism right away.
> >>
> >> Thanks,
> >> Vivek
> >>
> >>>
> >>> This way this private vfio-pci interface can also be used by
> >>> iommufd for
> >>> example.
> >>>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>
> >>>> Thanks,
> >>>> Vivek
> >>>>
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>>
> >>>>>>> What Simona agreed on is exactly what I proposed as well,
> >>>>>>> that you
> >>>>>>> get a private interface for exactly that use case.
> >>>>>>
> >>>>>> A "private" interface to exchange phys_addr_t between at
> >>>>>> least
> >>>>>> VFIO/KVM/iommufd - sure no complaint with that.
> >>>>>>
> >>>>>> Jason
> >>>>
> >>
> >


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-23 12:45                             ` Christian König
  2025-09-23 13:12                               ` Jason Gunthorpe
@ 2025-09-23 13:36                               ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-09-23 13:36 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Matthew Brost, Kasireddy, Vivek, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström

On Tue, Sep 23, 2025 at 02:45:10PM +0200, Christian König wrote:
> On 23.09.25 14:15, Jason Gunthorpe wrote:
> > On Tue, Sep 23, 2025 at 09:52:04AM +0200, Christian König wrote:
> >> For example the ISP driver part of amdgpu provides the V4L2
> >> interface and when we interchange a DMA-buf with it we recognize that
> >> it is actually the same device we work with.
> > 
> > One of the issues here is the mis-use of dma_map_resource() to create
> > dma_addr_t for PCI devices. This was never correct.
> 
> That is not a mis-use at all but rather exactly what dma_map_resource() was created for.

It isn't.  dma_map_resource was not created for PCIe P2P, and does not
work for the general case of PCIe P2P including offsets and switches.
Using it always was a big, and the drm driver maitaintainers were
constanty reminded of that and chose to ignore it with passion.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22 11:22               ` Christian König
  2025-09-22 12:20                 ` Jason Gunthorpe
@ 2025-09-23  6:01                 ` Kasireddy, Vivek
  1 sibling, 0 replies; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-23  6:01 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Simona Vetter
  Cc: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org,
	Thomas Hellström, Brost, Matthew

Hi Christian,

> 
> Hi guys,
> 
> On 22.09.25 08:59, Kasireddy, Vivek wrote:
> > Hi Jason,
> >
> >> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for
> device
> >> functions of Intel GPUs
> >>
> >> On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
> >>>> In this case messing with ACS is completely wrong. If the intention is
> >>>> to convay a some kind of "private" address representing the physical
> >>>> VRAM then you need to use a DMABUF mechanism to do that, not
> >> deliver a
> >>>> P2P address that the other side cannot access.
> >>
> >>> I think using a PCI BAR Address works just fine in this case because the
> Xe
> >>> driver bound to PF on the Host can easily determine that it belongs to
> one
> >>> of the VFs and translate it into VRAM Address.
> >>
> >> That isn't how the P2P or ACS mechansim works in Linux, it is about
> >> the actual address used for DMA.
> > Right, but this is not dealing with P2P DMA access between two random,
> > unrelated devices. Instead, this is a special situation involving a GPU PF
> > trying to access the VRAM of a VF that it provisioned and holds a
> reference
> > on (note that the backing object for VF's VRAM is pinned by Xe on Host
> > as part of resource provisioning). But it gets treated as regular P2P DMA
> > because the exporters rely on pci_p2pdma_distance() or
> > pci_p2pdma_map_type() to determine P2P compatibility.
> >
> > In other words, I am trying to look at this problem differently: how can the
> > PF be allowed to access the VF's resource that it provisioned, particularly
> > when the VF itself requests the PF to access it and when a hardware path
> > (via PCIe fabric) is not required/supported or doesn't exist at all?
> 
> Well what exactly is happening here? You have a PF assigned to the host
> and a VF passed through to a guest, correct?
Yes, correct.

> 
> And now the PF (from the host side) wants to access a BAR of the VF?
Yes, that is indeed the use-case, except that the PF cannot access a buffer
located in VF's VRAM portion via the BAR because this path is likely not
supported by our hardware. Therefore, my proposal (via this patch series)
is to translate the BAR addresses into VRAM addresses in Xe driver (on the Host).

Here are some more details about the use-case (copied from an earlier reply
to Jason):
- Xe Graphics driver, bound to GPU PF on the Host provisions its resources
including VRAM among all the VFs.

- A GPU VF device is bound to vfio-pci and assigned to a Linux VM which
is launched via Qemu.

- The Xe Graphics driver running inside the Linux VM creates a buffer
(Gnome Wayland compositor's framebuffer) in the VF's portion (or share)
of the VRAM and this buffer is shared with Qemu. Qemu then requests
vfio-pci driver to create a dmabuf associated with this buffer.

- Next, Qemu (UI layer) requests the GPU PF (via the Xe driver) to import
the dmabuf (for display purposes) located in VF's portion of the VRAM. This
is where two problems occur: 

1) The exporter (vfio-pci driver in this case) calls pci_p2pdma_map_type()
to determine the mapping type (or check P2P compatibility) between both
devices (GPU VF and PF) but it fails due to the ACS enforcement check because
the PCIe upstream bridge is not whitelisted, which is a common problem on
workstations/desktops/laptops.

2) Assuming that pci_p2pdma_map_type() did not fail (likely on server systems
with whitelisted PCIe bridges), based on my experiments, the GPU PF is unable
to access the buffer located in VF's VRAM portion directly because it is represented
using PCI BAR addresses. (note that, the PCI BAR address is the DMA address
here which seems to be a common practice among GPU drivers including Xe and
Amdgpu when exporting dmabufs to other devices).

The only way this seems to be work at the moment is if the BAR addresses
are translated into VRAM addresses that the GPU PF understands (this is done
done inside Xe driver on the Host using provisioning data). Note that this buffer
is accessible by the CPU using BAR addresses but it is very slow.

So, in summary, given that the GPU PF does not need to use PCIe fabric in
order to access the buffer located in GPU VF's portion of the VRAM in this
use-case, I figured adding a quirk (to not enforce ACS check) would solve 1)
and implementing the BAR to VRAM address translation in Xe driver on the
Host would solve 2) above.

Also, Jason suggested that using dmabuf private address mechanism would
help with my use-case. Could you please share details about how it can be
used here?

Thanks,
Vivek

> 
> Regards,
> Christian.
> 
> >
> > Furthermore, note that on a server system with a whitelisted PCIe
> upstream
> > bridge, this quirk would not be needed at all as pci_p2pdma_map_type()
> > would not have failed and this would have been a purely Xe driver specific
> > problem to solve that would have required just the translation logic and
> no
> > further changes anywhere. But my goal is to fix it across systems like
> > workstations/desktops that do not typically have whitelisted PCIe
> upstream
> > bridges.
> >
> >>
> >> You can't translate a dma_addr_t to anything in the Xe PF driver
> >> anyhow, once it goes through the IOMMU the necessary information is
> lost.
> > Well, I already tested this path (via IOMMU, with your earlier vfio-pci +
> > dmabuf patch that used dma_map_resource() and also with Leon's latest
> > version) and found that I could still do the translation in the Xe PF driver
> > after first calling iommu_iova_to_phys().
> >
> >> This is a fundamentally broken design to dma map something and
> >> then try to reverse engineer the dma_addr_t back to something with
> >> meaning.
> > IIUC, I don't think this is a new or radical idea. I think the concept is
> slightly
> > similar to using bounce buffers to address hardware DMA limitations
> except
> > that there are no memory copies and the CPU is not involved. And, I don't
> see
> > any other way to do this because I don't believe the exporter can provide
> a
> > DMA address that the importer can use directly without any translation,
> which
> > seems unavoidable in this case.
> >
> >>
> >>>> Christian told me dmabuf has such a private address mechanism, so
> >>>> please figure out a way to use it..
> >>>
> >>> Even if such as a mechanism exists, we still need a way to prevent
> >>> pci_p2pdma_map_type() from failing when invoked by the exporter
> (vfio-
> >> pci).
> >>> Does it make sense to move this quirk into the exporter?
> >>
> >> When you export a private address through dmabuf the VFIO exporter
> >> will not call p2pdma paths when generating it.
> > I have cc'd Christian and Simona. Hopefully, they can help explain how
> > the dmabuf private address mechanism can be used to address my
> > use-case. And, I sincerely hope that it will work, otherwise I don't see
> > any viable path forward for what I am trying to do other than using this
> > quirk and translation. Note that the main reason why I am doing this
> > is because I am seeing at-least ~35% performance gain when running
> > light 3D/Gfx workloads.
> >
> >>
> >>> Also, AFAICS, translating BAR Address to VRAM Address can only be
> >>> done by the Xe driver bound to PF because it has access to provisioning
> >>> data. In other words, vfio-pci would not be able to share any other
> >>> address other than the BAR Address because it wouldn't know how to
> >>> translate it to VRAM Address.
> >>
> >> If you have a vfio varient driver then the VF vfio driver could call
> >> the Xe driver to create a suitable dmabuf using the private
> >> addressing. This is probably what is required here if this is what you
> >> are trying to do.
> > Could this not be done via the vendor agnostic vfio-pci (+ dmabuf) driver
> > instead of having to use a separate VF/vfio variant driver?
> >
> >>
> >>>> No, don't, it is completely wrong to mess with ACS flags for the
> >>>> problem you are trying to solve.
> >>
> >>> But I am not messing with any ACS flags here. I am just adding a quirk to
> >>> sidestep the ACS enforcement check given that the PF to VF access does
> >>> not involve the PCIe fabric in this case.
> >>
> >> Which is completely wrong. These are all based on fabric capability,
> >> not based on code in drivers to wrongly "translate" the dma_addr_t.
> > I am not sure why you consider translation to be wrong in this case
> > given that it is done by a trusted entity (Xe PF driver) that is bound to
> > the GPU PF and provisioned the resource that it is trying to access.
> > What limitations do you see with this approach?
> >
> > Also, the quirk being added in this patch is indeed meant to address a
> > specific case (GPU PF to VF access) to workaround a potential hardware
> > limitation (non-existence of a direct PF to VF DMA access path via the
> > PCIe fabric). Isn't that one of the main ideas behind using quirks -- to
> > address hardware limitations?
> >
> > Thanks,
> > Vivek
> >
> >>
> >> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-22  6:59             ` Kasireddy, Vivek
  2025-09-22 11:22               ` Christian König
@ 2025-09-22 12:12               ` Jason Gunthorpe
  1 sibling, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 12:12 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Christian König, Simona Vetter,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Bjorn Helgaas, Logan Gunthorpe, linux-pci@vger.kernel.org

On Mon, Sep 22, 2025 at 06:59:10AM +0000, Kasireddy, Vivek wrote:

> > You can't translate a dma_addr_t to anything in the Xe PF driver
> > anyhow, once it goes through the IOMMU the necessary information is lost.
> Well, I already tested this path (via IOMMU, with your earlier vfio-pci +
> dmabuf patch that used dma_map_resource() and also with Leon's latest
> version) and found that I could still do the translation in the Xe PF driver
> after first calling iommu_iova_to_phys().

I would NAK any driver doing something so hacky. You are approaching
this completely wrong to abuse things like this.

> that there are no memory copies and the CPU is not involved. And, I don't see
> any other way to do this because I don't believe the exporter can provide a
> DMA address that the importer can use directly without any translation, which
> seems unavoidable in this case.

This is sadly how dmabuf is designed, please talk to Christain to
figure out how to exchage the right kind of address.

> use-case. And, I sincerely hope that it will work, otherwise I don't see
> any viable path forward for what I am trying to do other than using this
> quirk and translation. 

You will not get a quirk.

> PCIe fabric). Isn't that one of the main ideas behind using quirks -- to
> address hardware limitations?

This is not a hardware limitation. You are using DMABUF wrong, writing
hacks into the PF driver and then trying to make up a fake ACS quirk
:(

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-19  6:22         ` Kasireddy, Vivek
  2025-09-19 12:29           ` Jason Gunthorpe
@ 2025-09-24 16:13           ` Simon Richter
  2025-09-24 17:12             ` Jason Gunthorpe
  2025-09-25  4:06             ` Kasireddy, Vivek
  1 sibling, 2 replies; 46+ messages in thread
From: Simon Richter @ 2025-09-24 16:13 UTC (permalink / raw)
  To: Kasireddy, Vivek, Jason Gunthorpe, Christian König,
	Matthew Brost, Christoph Hellwig, dri-devel, intel-xe, linux-pci

Hi,

since I'm late to the party I'll reply to the entire thread in one go.

On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:

> I think using a PCI BAR Address works just fine in this case because the Xe
> driver bound to PF on the Host can easily determine that it belongs to one
> of the VFs and translate it into VRAM Address.

There are PCIe bridges that support address translation, and might apply
different translations for different PASIDs, so this determination would
need to walk the device tree on both guest and host in a way that does
not confer trust to the guest or allows it to gain access to resources
through race conditions.

The difficulty here is that you are building a communication mechanism
that bypasses a trust boundary in the virtualization framework, so it
becomes part of the virtualization framework. I believe we can avoid
that to some extent by exchanging handles instead of raw pointers.

I can see the point in using the dmabuf API, because it integrates well
with existing 3D APIs in userspace, although I don't quite understand
what the VK_KHR_external_memory_dma_buf extension actually does, besides
defining a flag bit -- it seems the heavy lifting is done by the
VK_KHR_external_memory_fd extension anyway. But yes, we probably want
the interface to be compatible to existing sharing APIs on the host side
at least, to allow the guest's "on-screen" images to be easily imported.

There is some potential for a shortcut here as well, giving these
buffers directly to the host's desktop compositor instead of having an
application react to updates by copying the data from the area shared
with the VF to the area shared between the application and the
compositor -- that would also be a reason to remain close to the
existing interface.

It's not entirely necessary for this interface to be a dma_buf, as long
as we have a conversion between a file descriptor and a BO.  On the
other hand, it may be desirable to allow re-exporting it as a dma_buf if
we want to access it from another device as well.

I'm not sure that is a likely use case though, even the horrible
contraption I'm building here that has a Thunderbolt device send data
directly to VRAM does not require that, because the guest would process
the data and then send a different buffer to the host. Still would be
nice for completeness.

The other thing that seems to be looming on the horizon is that dma_buf
is too limited for VRAM buffers, because once it's imported, it is
pinned as well, but we'd like to keep it moveable (there was another
thread on the xe mailing list about that). That might even be more
important if we have limited BAR space, because then we might not want
to make the memory accessible through the BAR unless imported by
something that needs access through the BAR, which we've established the
main use case doesn't (because it doesn't even need any kind of access).

I think passing objects between trust domains should take the form of an
opaque handle that is not predictable, and refers to an internal data
structure with the actual parameters (so we pass these internally as
well, and avoid all the awkwardness of host and guest having different
world views. It doesn't matter if that path is slow, it should only be
used rather seldom (at VM start and when the VM changes screen
resolution).

For VM startup, we probably want to provision guest "on-screen" memory
and semaphores really early -- maybe it makes sense to just give each VF
a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by
default, and/or present a ROM with EFI and OpenFirmware drivers -- can
VFs do that on current hardware?

On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote:

> IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> to never expose VRAM Addresses and instead have BAR addresses as DMA
> addresses when exporting dmabufs to other devices.

Yes, because that is how the other devices access that memory.

> The problem here is that the CPU physical (aka BAR Address) is only
> usable by the CPU.

The address you receive from mapping a dma_buf for a particular device
is not a CPU physical address, even if it is identical on pretty much
all PC hardware because it is uncommon to configure the root bridge with
a translation there.

On my POWER9 machine, the situation is a bit different: a range in the
lower 4 GB is reserved for 32-bit BARs, the memory with those physical
addresses is remapped so it appears after the end of physical RAM from
the point of view of PCIe devices, and the 32 bit BARs appear at the
base of the PCIe bus (after the legacy ports).

So, as an example (reality is a bit more complex :> ) the memory map
might look like

0000000000000000..0000001fffffffff    RAM
0060000000000000..006001ffffffffff    PCIe domain 1
0060020000000000..006003ffffffffff    PCIe domain 2
...

and the phys_addr_t I get on the CPU refers to this mapping. However, a
device attached to PCIe domain 1 would see

0000000000000000..000000000000ffff    Legacy I/O in PCIe domain 1
0000000000010000..00000000000fffff    Legacy VGA mappings
0000000000100000..000000007fffffff    32-bit BARs in PCIe domain 1
0000000080000000..00000000ffffffff    RAM (accessible to 32 bit devices)
0000000100000000..0000001fffffffff    RAM (requires 64 bit addressing)
0000002000000000..000000207fffffff    RAM (CPU physical address 0..2GB)
0060000080000000..006001ffffffffff    64-bit BARs in PCIe domain 1
0060020000000000..006003ffffffffff    PCIe domain 2

This allows 32 bit devices to access other 32 bit devices on the same
bus, and (some) physical memory, but we need to sacrifice the 1:1
mapping for host memory. The actual mapping is a bit more complex,
because 64 bit BARs get mapped into the "32 bit" space to keep them
accessible for 32 bit cards in the same domain, and this would also be a
valid reason not to extend the BAR size even if we can.

The default 256 MB aperture ends up in the "32 bit" range, so unless the
BAR is resized and reallocated, the CPU and DMA addresses for the
aperture *will* differ.

So when a DMA buffer is created that ends up in the first 2 GB of RAM,
the dma_addr_t returned for this device will have 0x2000000000 added to
it, because that is the address that the device will have to use, and
DMA buffers for 32 bit devices will be taken from the 2GB..4GB range
because neither the first 2 GB nor anything beyond 4 GB are accessible
to this device.

If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will
see it at 0x60000010000000, but mapping it from another device in the
same domain will return a dma_addr_t of 0x10000000 -- because that is
the address that is routeable in the PCIe fabric, this is the BAR
address configured into the device so it will actually respond, and the
TLP will not leave the bus because it is downstream of the root bridge,
so it does not affect the physical RAM.

Actual numbers will be different to handle even more corner cases and I
don't remember exactly how many zeroes are in each range, but you get
the idea -- and this is before we've even started creating virtual
machines with a different view of physical addresses.

On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote:

> - The Xe Graphics driver running inside the Linux VM creates a buffer
> (Gnome Wayland compositor's framebuffer) in the VF's portion (or share)
> of the VRAM and this buffer is shared with Qemu. Qemu then requests
> vfio-pci driver to create a dmabuf associated with this buffer.

That's a bit late. What is EFI supposed to do?

   Simon

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-24 16:13           ` Simon Richter
@ 2025-09-24 17:12             ` Jason Gunthorpe
  2025-09-25  4:06             ` Kasireddy, Vivek
  1 sibling, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-09-24 17:12 UTC (permalink / raw)
  To: Simon Richter
  Cc: Kasireddy, Vivek, Christian König, Matthew Brost,
	Christoph Hellwig, dri-devel, intel-xe, linux-pci

On Wed, Sep 24, 2025 at 06:13:56PM +0200, Simon Richter wrote:
> > The problem here is that the CPU physical (aka BAR Address) is only
> > usable by the CPU.
> 
> The address you receive from mapping a dma_buf for a particular device
> is not a CPU physical address, even if it is identical on pretty much
> all PC hardware because it is uncommon to configure the root bridge with
> a translation there.

I said already, you cannot convert from a dma_addr_t back to a
phys_addr_t. There is just no universal API for this and your examples
like PPC explain why it cannot work even if some hacks appear to be
okay on one x86 system.

I think Christian's suggestion is to pass a
(struct pci_dev *, bar index, bar offset) between dmabuf
exporter/importer

That would work for alot of use case, we could make iommufd and kvm
work with that.

The Xe PF driver could detect the vram by checking if the pci_dev * is
a VF of itself and then using its internal knowledge of the
provisioned VF BAR to compute the VRAM location.

If it is not this case then it would fall back to the normal 'exporter
does dma map' fllow.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs
  2025-09-24 16:13           ` Simon Richter
  2025-09-24 17:12             ` Jason Gunthorpe
@ 2025-09-25  4:06             ` Kasireddy, Vivek
  1 sibling, 0 replies; 46+ messages in thread
From: Kasireddy, Vivek @ 2025-09-25  4:06 UTC (permalink / raw)
  To: Simon Richter, Jason Gunthorpe, Christian König,
	Brost, Matthew, Christoph Hellwig,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	linux-pci@vger.kernel.org

Hi Simon,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> Hi,
> 
> since I'm late to the party I'll reply to the entire thread in one go.
> 
> On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
> 
> > I think using a PCI BAR Address works just fine in this case because the Xe
> > driver bound to PF on the Host can easily determine that it belongs to one
> > of the VFs and translate it into VRAM Address.
> 
> There are PCIe bridges that support address translation, and might apply
> different translations for different PASIDs, so this determination would
> need to walk the device tree on both guest and host in a way that does
> not confer trust to the guest or allows it to gain access to resources
> through race conditions.
> 
> The difficulty here is that you are building a communication mechanism
> that bypasses a trust boundary in the virtualization framework, so it
> becomes part of the virtualization framework. I believe we can avoid
> that to some extent by exchanging handles instead of raw pointers.
> 
> I can see the point in using the dmabuf API, because it integrates well
> with existing 3D APIs in userspace, although I don't quite understand
> what the VK_KHR_external_memory_dma_buf extension actually does,
> besides
> defining a flag bit -- it seems the heavy lifting is done by the
> VK_KHR_external_memory_fd extension anyway. But yes, we probably want
> the interface to be compatible to existing sharing APIs on the host side
> at least, to allow the guest's "on-screen" images to be easily imported.
> 
> There is some potential for a shortcut here as well, giving these
> buffers directly to the host's desktop compositor instead of having an
> application react to updates by copying the data from the area shared
> with the VF to the area shared between the application and the
> compositor -- that would also be a reason to remain close to the
> existing interface.
> 
> It's not entirely necessary for this interface to be a dma_buf, as long
> as we have a conversion between a file descriptor and a BO.  On the
> other hand, it may be desirable to allow re-exporting it as a dma_buf if
> we want to access it from another device as well.
> 
> I'm not sure that is a likely use case though, even the horrible
> contraption I'm building here that has a Thunderbolt device send data
> directly to VRAM does not require that, because the guest would process
> the data and then send a different buffer to the host. Still would be
> nice for completeness.
> 
> The other thing that seems to be looming on the horizon is that dma_buf
> is too limited for VRAM buffers, because once it's imported, it is
> pinned as well, but we'd like to keep it moveable (there was another
> thread on the xe mailing list about that). That might even be more
> important if we have limited BAR space, because then we might not want
> to make the memory accessible through the BAR unless imported by
> something that needs access through the BAR, which we've established the
> main use case doesn't (because it doesn't even need any kind of access).
> 
> I think passing objects between trust domains should take the form of an
> opaque handle that is not predictable, and refers to an internal data
> structure with the actual parameters (so we pass these internally as
> well, and avoid all the awkwardness of host and guest having different
> world views. It doesn't matter if that path is slow, it should only be
> used rather seldom (at VM start and when the VM changes screen
> resolution).
> 
> For VM startup, we probably want to provision guest "on-screen" memory
> and semaphores really early -- maybe it makes sense to just give each VF
> a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by
> default, and/or present a ROM with EFI and OpenFirmware drivers -- can
> VFs do that on current hardware?
> 
> On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote:
> 
> > IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> > to never expose VRAM Addresses and instead have BAR addresses as DMA
> > addresses when exporting dmabufs to other devices.
> 
> Yes, because that is how the other devices access that memory.
> 
> > The problem here is that the CPU physical (aka BAR Address) is only
> > usable by the CPU.
> 
> The address you receive from mapping a dma_buf for a particular device
> is not a CPU physical address, even if it is identical on pretty much
> all PC hardware because it is uncommon to configure the root bridge with
> a translation there.
> 
> On my POWER9 machine, the situation is a bit different: a range in the
> lower 4 GB is reserved for 32-bit BARs, the memory with those physical
> addresses is remapped so it appears after the end of physical RAM from
> the point of view of PCIe devices, and the 32 bit BARs appear at the
> base of the PCIe bus (after the legacy ports).
> 
> So, as an example (reality is a bit more complex :> ) the memory map
> might look like
> 
> 0000000000000000..0000001fffffffff    RAM
> 0060000000000000..006001ffffffffff    PCIe domain 1
> 0060020000000000..006003ffffffffff    PCIe domain 2
> ...
> 
> and the phys_addr_t I get on the CPU refers to this mapping. However, a
> device attached to PCIe domain 1 would see
> 
> 0000000000000000..000000000000ffff    Legacy I/O in PCIe domain 1
> 0000000000010000..00000000000fffff    Legacy VGA mappings
> 0000000000100000..000000007fffffff    32-bit BARs in PCIe domain 1
> 0000000080000000..00000000ffffffff    RAM (accessible to 32 bit devices)
> 0000000100000000..0000001fffffffff    RAM (requires 64 bit addressing)
> 0000002000000000..000000207fffffff    RAM (CPU physical address 0..2GB)
> 0060000080000000..006001ffffffffff    64-bit BARs in PCIe domain 1
> 0060020000000000..006003ffffffffff    PCIe domain 2
> 
> This allows 32 bit devices to access other 32 bit devices on the same
> bus, and (some) physical memory, but we need to sacrifice the 1:1
> mapping for host memory. The actual mapping is a bit more complex,
> because 64 bit BARs get mapped into the "32 bit" space to keep them
> accessible for 32 bit cards in the same domain, and this would also be a
> valid reason not to extend the BAR size even if we can.
> 
> The default 256 MB aperture ends up in the "32 bit" range, so unless the
> BAR is resized and reallocated, the CPU and DMA addresses for the
> aperture *will* differ.
> 
> So when a DMA buffer is created that ends up in the first 2 GB of RAM,
> the dma_addr_t returned for this device will have 0x2000000000 added to
> it, because that is the address that the device will have to use, and
> DMA buffers for 32 bit devices will be taken from the 2GB..4GB range
> because neither the first 2 GB nor anything beyond 4 GB are accessible
> to this device.
> 
> If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will
> see it at 0x60000010000000, but mapping it from another device in the
> same domain will return a dma_addr_t of 0x10000000 -- because that is
> the address that is routeable in the PCIe fabric, this is the BAR
> address configured into the device so it will actually respond, and the
> TLP will not leave the bus because it is downstream of the root bridge,
> so it does not affect the physical RAM.
> 
> Actual numbers will be different to handle even more corner cases and I
> don't remember exactly how many zeroes are in each range, but you get
> the idea -- and this is before we've even started creating virtual
> machines with a different view of physical addresses.
Thank you for taking the time to explain in detail how the memory map and
PCI addressing mechanism works.

> 
> On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote:
> 
> > - The Xe Graphics driver running inside the Linux VM creates a buffer
> > (Gnome Wayland compositor's framebuffer) in the VF's portion (or share)
> > of the VRAM and this buffer is shared with Qemu. Qemu then requests
> > vfio-pci driver to create a dmabuf associated with this buffer.
> 
> That's a bit late. What is EFI supposed to do?
If I understand your question correctly, what happens is the Guest VM's
EFI/BIOS Boot/Kernel messages are all displayed via virtio-vga (which is
included by default?) if it is added to the VM. And, the VF's VRAM does not
get used until Gnome/Mutter compositor starts. So, until this point, all
buffers are created from Guest VM's system memory only.

Thanks,
Vivek

> 
>    Simon

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2025-09-26  6:12 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250915072428.1712837-1-vivek.kasireddy@intel.com>
2025-09-15  7:21 ` [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device functions of Intel GPUs Vivek Kasireddy
2025-09-15 15:33   ` Logan Gunthorpe
2025-09-16 17:34   ` Bjorn Helgaas
2025-09-16 17:59     ` Jason Gunthorpe
2025-09-16 17:57   ` Jason Gunthorpe
2025-09-18  6:16     ` Kasireddy, Vivek
2025-09-18 12:04       ` Jason Gunthorpe
2025-09-19  6:22         ` Kasireddy, Vivek
2025-09-19 12:29           ` Jason Gunthorpe
2025-09-22  6:59             ` Kasireddy, Vivek
2025-09-22 11:22               ` Christian König
2025-09-22 12:20                 ` Jason Gunthorpe
2025-09-22 12:25                   ` Christian König
2025-09-22 12:29                     ` Jason Gunthorpe
2025-09-22 13:20                       ` Christian König
2025-09-22 13:27                         ` Jason Gunthorpe
2025-09-22 13:57                           ` Christian König
2025-09-22 14:00                             ` Jason Gunthorpe
2025-09-23  5:53                   ` Kasireddy, Vivek
2025-09-23  6:25                     ` Matthew Brost
2025-09-23  6:44                       ` Matthew Brost
2025-09-23  7:52                         ` Christian König
2025-09-23 12:15                           ` Jason Gunthorpe
2025-09-23 12:45                             ` Christian König
2025-09-23 13:12                               ` Jason Gunthorpe
2025-09-23 13:28                                 ` Christian König
2025-09-23 13:38                                   ` Jason Gunthorpe
2025-09-23 13:48                                     ` Christian König
2025-09-23 23:02                                       ` Matthew Brost
2025-09-24  8:29                                         ` Christian König
2025-09-24  6:50                                       ` Kasireddy, Vivek
2025-09-24  7:21                                         ` Christian König
2025-09-25  3:56                                           ` Kasireddy, Vivek
2025-09-25 10:51                                             ` Thomas Hellström
2025-09-25 11:28                                               ` Christian König
2025-09-25 13:11                                                 ` Thomas Hellström
2025-09-25 13:33                                                   ` Jason Gunthorpe
2025-09-25 15:40                                                     ` Thomas Hellström
2025-09-25 15:55                                                       ` Jason Gunthorpe
2025-09-26  6:12                                                 ` Kasireddy, Vivek
2025-09-23 13:36                               ` Christoph Hellwig
2025-09-23  6:01                 ` Kasireddy, Vivek
2025-09-22 12:12               ` Jason Gunthorpe
2025-09-24 16:13           ` Simon Richter
2025-09-24 17:12             ` Jason Gunthorpe
2025-09-25  4:06             ` Kasireddy, Vivek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox