All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "Christian König" <christian.koenig@amd.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cornelia Huck" <cohuck@redhat.com>,
	dri-devel@lists.freedesktop.org, kvm@vger.kernel.org,
	linaro-mm-sig@lists.linaro.org, linux-media@vger.kernel.org,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"Leon Romanovsky" <leon@kernel.org>,
	linux-rdma@vger.kernel.org, "Maor Gottlieb" <maorg@nvidia.com>,
	"Oded Gabbay" <ogabbay@kernel.org>,
	"Dan Williams" <dan.j.williams@intel.com>
Subject: Re: [PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf
Date: Wed, 7 Sep 2022 13:12:52 -0300	[thread overview]
Message-ID: <YxjDBOIavc79ZByZ@nvidia.com> (raw)
In-Reply-To: <Yxi5h09JAzIo4Kh8@infradead.org>

On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
> >  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
> >     certain special cases.
> 
> Not just certain special cases, but one of the main use cases.
> Basically P2P can happen in two ways:
>
>  a) through a PCIe switch, or
>  b) through connected root ports

Yes, we tested both, both work.

> The open code version here only supports a), only supports it when there
> is no offset between the 'phyiscal' address of the BAR seen PCIe
> endpoint and the Linux way.  x86 usually (always?) doesn't have an
> offset there, but other architectures often do.

The PCI offset is some embedded thing - I've never seen it in a server
platform.

Frankly, it is just bad SOC design and there is good reason why
non-zero needs to be avoided. As soon as you create aliases between
the address spaces you invite trouble. IIRC a SOC I used once put the
memory at 0 -> 4G then put the only PCI aperture at 4g ->
4g+N. However this design requires 64 bit PCI support, which at the
time, the platform didn't have. So they used PCI offset to hackily
alias the aperture over the DDR. I don't remember if they threw out a
bit of DDR to resolve the alias, or if they just didn't support PCI
switches.

In any case, it is a complete mess. You either drastically limit your
BAR size, don't support PCI switches or loose a lot of DDR.

I also seem to remember that iommu and PCI offset don't play nice
together - so for the VFIO use case where the iommu is present I'm
pretty sure we can very safely assume 0 offset. That seems confirmed
by the fact that VFIO has never handled PCI offset in its own P2P path
and P2P works fine in VMs across a wide range of platforms.

That said, I agree we should really have APIs that support this
properly, and dma_map_resource is certainly technically wrong.

So, would you be OK with this series if I try to make a dma_map_p2p()
that resolves the offset issue?

> Last but not least I don't really see how the code would even work
> when an IOMMU is used, as dma_map_resource will return an IOVA that
> is only understood by the IOMMU itself, and not the other endpoint.

I don't understand this.

__iommu_dma_map() will put the given phys into the iommu_domain
associated with 'dev' and return the IOVA it picked.

Here 'dev' is the importing device, it is the device that will issue
the DMA:

+       dma_addr = dma_map_resource(
+               attachment->dev,
+               pci_resource_start(priv->vdev->pdev, priv->index) +
+                       priv->offset,
+               priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

eg attachment->dev is the PCI device of the RDMA device, not the VFIO
device.

'phys' is the CPU physical of the PCI BAR page, which with 0 PCI
offset is the right thing to program into the IO page table.

> How was this code even tested?

It was tested on a few platforms, like I said above, the cases where
it doesn't work are special, largely embedded, and not anything we
have in our labs - AFAIK.

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "Leon Romanovsky" <leon@kernel.org>,
	kvm@vger.kernel.org, linux-rdma@vger.kernel.org,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"Oded Gabbay" <ogabbay@kernel.org>,
	"Cornelia Huck" <cohuck@redhat.com>,
	dri-devel@lists.freedesktop.org,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	linaro-mm-sig@lists.linaro.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Maor Gottlieb" <maorg@nvidia.com>,
	"Christian König" <christian.koenig@amd.com>,
	linux-media@vger.kernel.org
Subject: Re: [PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf
Date: Wed, 7 Sep 2022 13:12:52 -0300	[thread overview]
Message-ID: <YxjDBOIavc79ZByZ@nvidia.com> (raw)
In-Reply-To: <Yxi5h09JAzIo4Kh8@infradead.org>

On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
> >  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
> >     certain special cases.
> 
> Not just certain special cases, but one of the main use cases.
> Basically P2P can happen in two ways:
>
>  a) through a PCIe switch, or
>  b) through connected root ports

Yes, we tested both, both work.

> The open code version here only supports a), only supports it when there
> is no offset between the 'phyiscal' address of the BAR seen PCIe
> endpoint and the Linux way.  x86 usually (always?) doesn't have an
> offset there, but other architectures often do.

The PCI offset is some embedded thing - I've never seen it in a server
platform.

Frankly, it is just bad SOC design and there is good reason why
non-zero needs to be avoided. As soon as you create aliases between
the address spaces you invite trouble. IIRC a SOC I used once put the
memory at 0 -> 4G then put the only PCI aperture at 4g ->
4g+N. However this design requires 64 bit PCI support, which at the
time, the platform didn't have. So they used PCI offset to hackily
alias the aperture over the DDR. I don't remember if they threw out a
bit of DDR to resolve the alias, or if they just didn't support PCI
switches.

In any case, it is a complete mess. You either drastically limit your
BAR size, don't support PCI switches or loose a lot of DDR.

I also seem to remember that iommu and PCI offset don't play nice
together - so for the VFIO use case where the iommu is present I'm
pretty sure we can very safely assume 0 offset. That seems confirmed
by the fact that VFIO has never handled PCI offset in its own P2P path
and P2P works fine in VMs across a wide range of platforms.

That said, I agree we should really have APIs that support this
properly, and dma_map_resource is certainly technically wrong.

So, would you be OK with this series if I try to make a dma_map_p2p()
that resolves the offset issue?

> Last but not least I don't really see how the code would even work
> when an IOMMU is used, as dma_map_resource will return an IOVA that
> is only understood by the IOMMU itself, and not the other endpoint.

I don't understand this.

__iommu_dma_map() will put the given phys into the iommu_domain
associated with 'dev' and return the IOVA it picked.

Here 'dev' is the importing device, it is the device that will issue
the DMA:

+       dma_addr = dma_map_resource(
+               attachment->dev,
+               pci_resource_start(priv->vdev->pdev, priv->index) +
+                       priv->offset,
+               priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

eg attachment->dev is the PCI device of the RDMA device, not the VFIO
device.

'phys' is the CPU physical of the PCI BAR page, which with 0 PCI
offset is the right thing to program into the IO page table.

> How was this code even tested?

It was tested on a few platforms, like I said above, the cases where
it doesn't work are special, largely embedded, and not anything we
have in our labs - AFAIK.

Jason

  reply	other threads:[~2022-09-07 16:13 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-31 23:12 [PATCH v2 0/4] Allow MMIO regions to be exported through dma-buf Jason Gunthorpe
2022-08-31 23:12 ` Jason Gunthorpe
2022-08-31 23:12 ` [PATCH v2 1/4] dma-buf: Add dma_buf_try_get() Jason Gunthorpe
2022-08-31 23:12   ` Jason Gunthorpe
2022-09-01  7:55   ` Christian König
2022-09-01  7:55     ` Christian König
2022-09-06 16:44     ` Jason Gunthorpe
2022-09-06 16:44       ` Jason Gunthorpe
2022-09-06 17:52       ` Christian König
2022-09-06 17:52         ` Christian König
2022-08-31 23:12 ` [PATCH v2 2/4] vfio: Add vfio_device_get() Jason Gunthorpe
2022-08-31 23:12   ` Jason Gunthorpe
2022-08-31 23:12 ` [PATCH v2 3/4] vfio_pci: Do not open code pci_try_reset_function() Jason Gunthorpe
2022-08-31 23:12   ` Jason Gunthorpe
2022-08-31 23:12 ` [PATCH v2 4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf Jason Gunthorpe
2022-08-31 23:12   ` Jason Gunthorpe
2022-09-06  9:51   ` Christoph Hellwig
2022-09-06 10:38     ` Christian König
2022-09-06 10:38       ` Christian König
2022-09-06 11:48       ` Jason Gunthorpe
2022-09-06 11:48         ` Jason Gunthorpe
2022-09-06 12:34         ` Oded Gabbay
2022-09-06 12:34           ` Oded Gabbay
2022-09-06 17:59           ` Jason Gunthorpe
2022-09-06 17:59             ` Jason Gunthorpe
2022-09-06 19:44             ` Oded Gabbay
2022-09-06 19:44               ` Oded Gabbay
2022-09-07 12:07               ` Christoph Hellwig
2022-09-07 12:05         ` Christoph Hellwig
2022-09-07 12:33           ` Jason Gunthorpe
2022-09-07 12:33             ` Jason Gunthorpe
2022-09-07 14:29             ` Christoph Hellwig
2022-09-07 14:46               ` Oded Gabbay
2022-09-07 14:46                 ` Oded Gabbay
2022-09-07 15:23               ` Jason Gunthorpe
2022-09-07 15:23                 ` Jason Gunthorpe
2022-09-07 15:32                 ` Christoph Hellwig
2022-09-07 16:12                   ` Jason Gunthorpe [this message]
2022-09-07 16:12                     ` Jason Gunthorpe
2022-09-09 13:24                     ` Christoph Hellwig
2022-09-09 14:09                       ` Jason Gunthorpe
2022-09-09 14:09                         ` Jason Gunthorpe
2022-09-07 16:31                 ` Robin Murphy
2022-09-07 16:31                   ` Robin Murphy
2022-09-07 16:47                   ` Jason Gunthorpe
2022-09-07 16:47                     ` Jason Gunthorpe
2022-09-07 17:03               ` Dan Williams
2022-09-07 17:03                 ` Dan Williams
2022-09-07 15:01             ` Robin Murphy
2022-09-07 15:01               ` Robin Murphy
2022-09-07 12:03       ` Christoph Hellwig
2022-09-07 15:08         ` Christian König
2022-09-07 15:08           ` Christian König
2022-09-07 15:23           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YxjDBOIavc79ZByZ@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=cohuck@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=leon@kernel.org \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=ogabbay@kernel.org \
    --cc=sumit.semwal@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.