From: Jason Gunthorpe <jgg@ziepe.ca>
To: Christoph Hellwig <hch@lst.de>
Cc: "Leon Romanovsky" <leon@kernel.org>,
"Robin Murphy" <robin.murphy@arm.com>,
"Marek Szyprowski" <m.szyprowski@samsung.com>,
"Joerg Roedel" <joro@8bytes.org>, "Will Deacon" <will@kernel.org>,
"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Jens Axboe" <axboe@kernel.dk>, "Keith Busch" <kbusch@kernel.org>,
"Sagi Grimberg" <sagi@grimberg.me>,
"Yishai Hadas" <yishaih@nvidia.com>,
"Shameer Kolothum" <shameerali.kolothum.thodi@huawei.com>,
"Kevin Tian" <kevin.tian@intel.com>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
iommu@lists.linux.dev, linux-nvme@lists.infradead.org,
kvm@vger.kernel.org, linux-mm@kvack.org,
"Bart Van Assche" <bvanassche@acm.org>,
"Damien Le Moal" <damien.lemoal@opensource.wdc.com>,
"Amir Goldstein" <amir73il@gmail.com>,
"josef@toxicpanda.com" <josef@toxicpanda.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
"daniel@iogearbox.net" <daniel@iogearbox.net>,
"Dan Williams" <dan.j.williams@intel.com>,
"jack@suse.com" <jack@suse.com>,
"Zhu Yanjun" <zyjzyj2000@gmail.com>
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps
Date: Fri, 8 Mar 2024 16:23:42 -0400 [thread overview]
Message-ID: <20240308202342.GZ9225@ziepe.ca> (raw)
In-Reply-To: <20240308164920.GA17991@lst.de>
On Fri, Mar 08, 2024 at 05:49:20PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote:
> > >
> > > It's just kinda hard to do. For aligned IOMMU mapping you'd only
> > > have one dma_addr_t mappings (or maybe a few if P2P regions are
> > > involved), so this probably doesn't matter. For direct mappings
> > > you'd have a few, but maybe the better answer is to use THP
> > > more aggressively and reduce the number of segments.
> >
> > Right, those things have all been done. 100GB of huge pages is still
> > using a fair amount of memory for storing dma_addr_t's.
> >
> > It is hard to do perfectly, but I think it is not so bad if we focus
> > on the direct only case and simple systems that can exclude swiotlb
> > early on.
>
> Even with direct mappings only we still need to take care of
> cache synchronization.
Yes, we still have to unmap, but the unmap for cache synchronization
doesn't need the dma_addr_t to flush the CPU cache.
> > > If all flows includes multiple non-coalesced regions that just makes
> > > things very complicated, and that's exactly what I'd want to avoid.
> >
> > I don't see how to avoid it unless we say RDMA shouldn't use this API,
> > which is kind of the whole point from my perspective..
>
> The DMA API callers really need to know what is P2P or not for
> various reasons. And they should generally have that information
> available, either from pin_user_pages that needs to special case
> it or from the in-kernel I/O submitter that build it from P2P and
> normal memory.
I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
shoves the resulting page list into in a scattertable. It never checks
if any returned page is P2P - it has no reason to care. dma_map_sg()
does all the work.
That is the kind of abstraction I am coming to this problem with.
You are looking at BIO where you already needed to split things up for
other reasons, but I think that is a uniquely block thing that will
not be shared in other subsystems.
> > If you don't preserve that then we are calling, 4k at a time, a
> > dma_map_page() which is not anywhere close to the same outcome as what
> > dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
> > and we call into the IOVA allocator a huge number of times.
>
> Again, your callers must know what is a P2P region and what is not.
I don't see this at all. We don't do this today in RDMA. There is no
"P2P region".
> > > That's why I really just want 2 cases. If the caller guarantees the
> > > range is coalescable and there is an IOMMU use the iommu-API like
> > > API, else just iter over map_single/page.
> >
> > But how does the caller even know if it is coalescable? Other than the
> > trivial case of a single CPU range, that is a complicated detail based
> > on what pages are inside the range combined with the capability of the
> > device doing DMA. I don't see a simple way for the caller to figure
> > this out. You need to sweep every page and collect some information on
> > it. The above is to abstract that detail.
>
> dma_get_merge_boundary already provides this information in terms
> of the device capabilities. And given that the callers knows what
> is P2P and what is not we have all the information that is needed.
Encrypted memory too.
RDMA also doesn't call dma_get_merge_boundary(). It doesn't keep track
of P2P regions. It doesn't break out encrypted memory. It has no
purpose to do any of those things.
You fundamentally cannot subdivide a memory registration.
So we could artificially introduce the concept of limited coalescing
into RDMA, dmabuf and others just to drive this new API - but really
that feels much much worse than just making the DMA API still able to
do IOMMU coalescing in more cases.
Even if we did that, it will still be less efficient than today where
we just call dma_map_sg() on the jumble of pages.
Jason
next prev parent reply other threads:[~2024-03-08 20:23 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-05 11:18 [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 01/16] mm/hmm: let users to tag specific PFNs Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 02/16] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 03/16] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 04/16] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 05/16] iommu/dma: Prepare map/unmap page functions to receive IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 06/16] iommu/dma: Implement link/unlink page callbacks Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 07/16] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 08/16] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 09/16] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 10/16] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 11/16] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 12/16] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 13/16] vfio/mlx5: Explicitly store page list Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 14/16] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 15/16] block: add dma_link_range() based API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL Leon Romanovsky
2024-03-05 15:51 ` Keith Busch
2024-03-05 16:08 ` Jens Axboe
2024-03-05 16:39 ` Chaitanya Kulkarni
2024-03-05 16:46 ` Chaitanya Kulkarni
2024-03-06 14:33 ` Christoph Hellwig
2024-03-06 15:05 ` Jason Gunthorpe
2024-03-06 16:14 ` Christoph Hellwig
2024-05-03 14:41 ` Zhu Yanjun
2024-05-05 13:23 ` Leon Romanovsky
2024-05-06 7:25 ` Zhu Yanjun
2024-03-05 12:05 ` [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Robin Murphy
2024-03-05 12:29 ` Leon Romanovsky
2024-03-06 14:44 ` Christoph Hellwig
2024-03-06 15:43 ` Jason Gunthorpe
2024-03-06 16:20 ` Christoph Hellwig
2024-03-06 17:44 ` Jason Gunthorpe
2024-03-06 22:14 ` Christoph Hellwig
2024-03-07 0:00 ` Jason Gunthorpe
2024-03-07 15:05 ` Christoph Hellwig
2024-03-07 21:01 ` Jason Gunthorpe
2024-03-08 16:49 ` Christoph Hellwig
2024-03-08 20:23 ` Jason Gunthorpe [this message]
2024-03-09 16:14 ` Christoph Hellwig
2024-03-10 9:35 ` Leon Romanovsky
2024-03-12 21:28 ` Christoph Hellwig
2024-03-13 7:46 ` Leon Romanovsky
2024-03-13 21:44 ` Christoph Hellwig
2024-03-19 15:36 ` Jason Gunthorpe
2024-03-20 8:55 ` Leon Romanovsky
2024-03-21 22:40 ` Christoph Hellwig
2024-03-22 17:46 ` Leon Romanovsky
2024-03-24 23:16 ` Christoph Hellwig
2024-03-21 22:39 ` Christoph Hellwig
2024-03-22 18:43 ` Jason Gunthorpe
2024-03-24 23:22 ` Christoph Hellwig
2024-03-27 17:14 ` Jason Gunthorpe
2024-03-07 6:01 ` Zhu Yanjun
2024-04-09 20:39 ` Zhu Yanjun
2024-05-02 23:32 ` Zeng, Oak
2024-05-03 11:57 ` Zhu Yanjun
2024-05-03 16:42 ` Jason Gunthorpe
2024-05-03 20:59 ` Zeng, Oak
2024-06-10 15:12 ` Zeng, Oak
2024-06-10 15:19 ` Zhu Yanjun
2024-06-10 16:18 ` Leon Romanovsky
2024-06-10 16:40 ` Zeng, Oak
2024-06-10 17:25 ` Jason Gunthorpe
2024-06-10 21:28 ` Zeng, Oak
2024-06-11 7:49 ` Zhu Yanjun
2024-06-11 15:45 ` Leon Romanovsky
2024-06-11 18:26 ` Zeng, Oak
2024-06-11 19:11 ` Leon Romanovsky
2024-06-11 15:39 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240308202342.GZ9225@ziepe.ca \
--to=jgg@ziepe.ca \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=amir73il@gmail.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chaitanyak@nvidia.com \
--cc=corbet@lwn.net \
--cc=damien.lemoal@opensource.wdc.com \
--cc=dan.j.williams@intel.com \
--cc=daniel@iogearbox.net \
--cc=hch@lst.de \
--cc=iommu@lists.linux.dev \
--cc=jack@suse.com \
--cc=jglisse@redhat.com \
--cc=joro@8bytes.org \
--cc=josef@toxicpanda.com \
--cc=kbusch@kernel.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=leon@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-rdma@vger.kernel.org \
--cc=m.szyprowski@samsung.com \
--cc=martin.petersen@oracle.com \
--cc=robin.murphy@arm.com \
--cc=sagi@grimberg.me \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=will@kernel.org \
--cc=yishaih@nvidia.com \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).