All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@lst.de>
To: Robin Murphy <robin.murphy@arm.com>
Cc: "Leon Romanovsky" <leon@kernel.org>,
	"Jens Axboe" <axboe@kernel.dk>, "Jason Gunthorpe" <jgg@ziepe.ca>,
	"Joerg Roedel" <joro@8bytes.org>, "Will Deacon" <will@kernel.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"Sagi Grimberg" <sagi@grimberg.me>,
	"Keith Busch" <kbusch@kernel.org>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Shameer Kolothum" <shameerali.kolothum.thodi@huawei.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Marek Szyprowski" <m.szyprowski@samsung.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	iommu@lists.linux.dev, linux-nvme@lists.infradead.org,
	linux-pci@vger.kernel.org, kvm@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
Date: Mon, 4 Nov 2024 10:58:31 +0100	[thread overview]
Message-ID: <20241104095831.GA28751@lst.de> (raw)
In-Reply-To: <3567312e-5942-4037-93dc-587f25f0778c@arm.com>

On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> being able to linearise a scatterlist in DMA address space. TBH I doubt 
> there are many actual scatter-gather-capable devices with significant 
> enough limitations to meaningfully benefit from DMA segment combining these 
> days - I've often thought that by now it might be a good idea to turn that 
> behaviour off by default and add an attribute for callers to explicitly 
> request it.

Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented.  While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.

Note that dma_map_sg also has two other very useful features:  batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.

>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting 
> sent to DMA jail for open-coding:

You don't get sent to jail.  But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.

>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> those two notions in any way related? We literally already have one generic 
> DMA operation which doesn't operate on struct page, yet needed nothing 
> "split up" to be possible.

Yeah, I don't really get the struct page argument.  In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all.  I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places.  Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.

>>   2. VFIO PCI live migration code is building a very large "page list"
>>      for the device. Instead of allocating a scatter list entry per allocated
>>      page it can just allocate an array of 'struct page *', saving a large
>>      amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which 
> it explicitly manages - why is it even pretending to need a generic DMA 
> API?

AFAIK that does isn't really vfio as we know it but the control device
for live migration.  But Leon or Jason might fill in more.

The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).

>
>>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>>      list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could 
> seemingly already be done by just mapping the pages, so how is it proving 
> any benefit of a fragile new interface?

Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.

> My big concern here is that a thin and vaguely-defined wrapper around the 
> IOMMU API is itself a step which smells strongly of "abuse and design 
> mistake", given that the basic notion of allocating DMA addresses in 
> advance clearly cannot generalise. Thus it really demands some considered 
> justification beyond "We must do something; This is something; Therefore we 
> must do this." to be convincing.

At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations.  The HMM use case I'll let others talk about.


  reply	other threads:[~2024-11-04  9:58 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 03/17] iommu: generalize the batched sync after map interface Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used Leon Romanovsky
2024-11-10 15:09   ` Zhu Yanjun
2024-11-10 15:19     ` Leon Romanovsky
2024-11-11  6:39     ` Christoph Hellwig
2024-11-11  7:19       ` Greg Sword
2024-10-30 15:12 ` [PATCH v1 05/17] dma: Provide an interface to allow allocate IOVA Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
2024-10-31 21:18   ` Robin Murphy
2024-11-04  9:10     ` Christoph Hellwig
2024-11-04 12:19       ` Jason Gunthorpe
2024-11-04 12:53         ` Christoph Hellwig
2024-11-07 14:50           ` Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
2024-10-31 21:18   ` Robin Murphy
2024-11-01 11:06     ` Leon Romanovsky
2024-11-04  9:15     ` Christoph Hellwig
2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
2024-10-31  1:41   ` Randy Dunlap
2024-10-31  7:59     ` Leon Romanovsky
2024-11-08 19:34   ` Jonathan Corbet
2024-11-08 20:03     ` Leon Romanovsky
2024-11-08 20:13       ` Jonathan Corbet
2024-11-08 20:27         ` Leon Romanovsky
2024-11-10 10:41           ` Leon Romanovsky
2024-11-11  6:38             ` Christoph Hellwig
2024-11-11  6:43               ` anish kumar
2024-11-11 14:59                 ` Jonathan Corbet
2024-10-30 15:12 ` [PATCH v1 10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 11/17] mm/hmm: provide generic DMA managing logic Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 12/17] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 17/17] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-10-31  1:44 ` [PATCH v1 00/17] Provide a new two step DMA mapping API Jens Axboe
2024-10-31  8:34   ` Christoph Hellwig
2024-10-31  9:05     ` Leon Romanovsky
2024-10-31  9:21       ` Christoph Hellwig
2024-10-31  9:37         ` Leon Romanovsky
2024-10-31 17:43           ` Jens Axboe
2024-10-31 20:43             ` Leon Romanovsky
2024-10-31 17:42     ` Jens Axboe
2024-10-31 21:17 ` Robin Murphy
2024-11-04  9:58   ` Christoph Hellwig [this message]
2024-11-04 11:39     ` Leon Romanovsky
2024-11-05 19:53     ` Jason Gunthorpe
2024-11-07  8:32       ` Christoph Hellwig
2024-11-07 13:28         ` Jason Gunthorpe
2024-11-07 13:50           ` Christoph Hellwig
2024-11-08 15:02             ` Jason Gunthorpe
2024-11-08 15:05               ` Christoph Hellwig
2024-11-08 15:25                 ` Jason Gunthorpe
2024-11-08 15:29                   ` Christoph Hellwig
2024-11-08 15:38                     ` Jason Gunthorpe
2024-11-12  6:01                       ` Christoph Hellwig
2024-11-13 18:41                         ` Jason Gunthorpe
2024-11-05 18:51 ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241104095831.GA28751@lst.de \
    --to=hch@lst.de \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=joro@8bytes.org \
    --cc=kbusch@kernel.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=m.szyprowski@samsung.com \
    --cc=robin.murphy@arm.com \
    --cc=sagi@grimberg.me \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=will@kernel.org \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.