From: Christoph Hellwig <hch@lst.de>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Christoph Hellwig" <hch@lst.de>,
"Leon Romanovsky" <leon@kernel.org>,
"Robin Murphy" <robin.murphy@arm.com>,
"Marek Szyprowski" <m.szyprowski@samsung.com>,
"Joerg Roedel" <joro@8bytes.org>, "Will Deacon" <will@kernel.org>,
"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Jens Axboe" <axboe@kernel.dk>, "Keith Busch" <kbusch@kernel.org>,
"Sagi Grimberg" <sagi@grimberg.me>,
"Yishai Hadas" <yishaih@nvidia.com>,
"Shameer Kolothum" <shameerali.kolothum.thodi@huawei.com>,
"Kevin Tian" <kevin.tian@intel.com>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
iommu@lists.linux.dev, linux-nvme@lists.infradead.org,
kvm@vger.kernel.org, linux-mm@kvack.org,
"Bart Van Assche" <bvanassche@acm.org>,
"Damien Le Moal" <damien.lemoal@opensource.wdc.com>,
"Amir Goldstein" <amir73il@gmail.com>,
"josef@toxicpanda.com" <josef@toxicpanda.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
"daniel@iogearbox.net" <daniel@iogearbox.net>,
"Dan Williams" <dan.j.williams@intel.com>,
"jack@suse.com" <jack@suse.com>,
"Zhu Yanjun" <zyjzyj2000@gmail.com>
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps
Date: Thu, 7 Mar 2024 16:05:05 +0100 [thread overview]
Message-ID: <20240307150505.GA28978@lst.de> (raw)
In-Reply-To: <20240307000036.GP9225@ziepe.ca>
On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> >
> > I don't think you can do without dma_addr_t storage. In most cases
> > your can just store the dma_addr_t in the LE/BE encoded hardware
> > SGL, so no extra storage should be needed though.
>
> RDMA (and often DRM too) generally doesn't work like that, the driver
> copies the page table into the device and then the only reason to have
> a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> eliminating long term dma_addr_t storage would be a worthwhile memory
> savings for large long lived user space memory registrations.
It's just kinda hard to do. For aligned IOMMU mapping you'd only
have one dma_addr_t mappings (or maybe a few if P2P regions are
involved), so this probably doesn't matter. For direct mappings
you'd have a few, but maybe the better answer is to use THP
more aggressively and reduce the number of segments.
> I wrote the list as from a single IO operation perspective, so all but
> 5 need to store a single IOVA range that could be stored in some
> simple non-dynamic memory along with whatever HW SGLs/etc are needed.
>
> The point of 5 being different is because the driver has to provide a
> dynamically sized list of dma_addr_t's as storage until unmap. 5 is
> the only case that requires that full list.
No, all cases need to store one or more ranges.
> > > So are you thinking something more like a driver flow of:
> > >
> > > .. extent IO and get # aligned pages and know if there is P2P ..
> > > dma_init_io(state, num_pages, p2p_flag)
> > > if (dma_io_single_range(state)) {
> > > // #2, #4
> > > for each io()
> > > dma_link_aligned_pages(state, io range)
> > > hw_sgl = (state->iova, state->len)
> > > } else {
> >
> > I think what you have a dma_io_single_range should become before
> > the dma_init_io. If we know we can't coalesce it really just is a
> > dma_map_{single,page,bvec} loop, no need for any extra state.
>
> I imagine dma_io_single_range() to just check a flag in state.
>
> I still want to call dma_init_io() for the non-coalescing cases
> because all the flows, regardless of composition, should be about as
> fast as dma_map_sg is today.
If all flows includes multiple non-coalesced regions that just makes
things very complicated, and that's exactly what I'd want to avoid.
> That means we need to always pre-allocate the IOVA in any case where
> the IOMMU might be active - even on a non-coalescing flow.
>
> IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> be used and we can't just call today's dma_map_page() in a loop on the
> non-coalescing side and pay the overhead of Nx IOVA allocations.
>
> In large part this is for RDMA, were a single P2P page in a large
> multi-gigabyte user memory registration shouldn't drastically harm the
> registration performance by falling down to doing dma_map_page, and an
> IOVA allocation, on a 4k page by page basis.
But that P2P page needs to be handled very differently, as with it
we can't actually use a single iova range. So I'm not sure how that
is even supposed to work. If you have
+-------+-----+-------+
| local | P2P | local |
+-------+-----+-------+
you need at least 3 hw SGL entries, as the IOVA won't be contigous.
> The other thing that got hand waved here is how does dma_init_io()
> know which of the 6 states we are looking at? I imagine we probably
> want to do something like:
>
> struct dma_io_summarize summary = {};
> for each io()
> dma_io_summarize_range(&summary, io range)
> dma_init_io(dev, &state, &summary);
> if (state->single_range) {
> } else {
> }
> dma_io_done_mapping(&state); <-- flush IOTLB once
That's why I really just want 2 cases. If the caller guarantees the
range is coalescable and there is an IOMMU use the iommu-API like
API, else just iter over map_single/page.
> Enhancing the single sgl case is not a big change, I think. It does
> seem simplifying for the driver to not have to coalesce SGLs to detect
> the single-SGL fast-path.
>
> > > This is not quite what you said, we split the driver flow based on
> > > needing 1 HW SGL vs need many HW SGL.
> >
> > That's at least what I intended to say, and I'm a little curious as what
> > it came across.
>
> Ok, I was reading the discussion more about as alignment than single
> HW SGL, I think you ment alignment as implying coalescing behavior
> implying single HW SGL..
Yes.
next prev parent reply other threads:[~2024-03-07 15:05 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-05 11:18 [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 01/16] mm/hmm: let users to tag specific PFNs Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 02/16] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 03/16] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 04/16] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 05/16] iommu/dma: Prepare map/unmap page functions to receive IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 06/16] iommu/dma: Implement link/unlink page callbacks Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 07/16] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 08/16] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 09/16] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 10/16] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 11/16] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 12/16] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 13/16] vfio/mlx5: Explicitly store page list Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 14/16] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 15/16] block: add dma_link_range() based API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL Leon Romanovsky
2024-03-05 15:51 ` Keith Busch
2024-03-05 16:08 ` Jens Axboe
2024-03-05 16:39 ` Chaitanya Kulkarni
2024-03-05 16:46 ` Chaitanya Kulkarni
2024-03-06 14:33 ` Christoph Hellwig
2024-03-06 15:05 ` Jason Gunthorpe
2024-03-06 16:14 ` Christoph Hellwig
2024-05-03 14:41 ` Zhu Yanjun
2024-05-05 13:23 ` Leon Romanovsky
2024-05-06 7:25 ` Zhu Yanjun
2024-03-05 12:05 ` [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Robin Murphy
2024-03-05 12:29 ` Leon Romanovsky
2024-03-06 14:44 ` Christoph Hellwig
2024-03-06 15:43 ` Jason Gunthorpe
2024-03-06 16:20 ` Christoph Hellwig
2024-03-06 17:44 ` Jason Gunthorpe
2024-03-06 22:14 ` Christoph Hellwig
2024-03-07 0:00 ` Jason Gunthorpe
2024-03-07 15:05 ` Christoph Hellwig [this message]
2024-03-07 21:01 ` Jason Gunthorpe
2024-03-08 16:49 ` Christoph Hellwig
2024-03-08 20:23 ` Jason Gunthorpe
2024-03-09 16:14 ` Christoph Hellwig
2024-03-10 9:35 ` Leon Romanovsky
2024-03-12 21:28 ` Christoph Hellwig
2024-03-13 7:46 ` Leon Romanovsky
2024-03-13 21:44 ` Christoph Hellwig
2024-03-19 15:36 ` Jason Gunthorpe
2024-03-20 8:55 ` Leon Romanovsky
2024-03-21 22:40 ` Christoph Hellwig
2024-03-22 17:46 ` Leon Romanovsky
2024-03-24 23:16 ` Christoph Hellwig
2024-03-21 22:39 ` Christoph Hellwig
2024-03-22 18:43 ` Jason Gunthorpe
2024-03-24 23:22 ` Christoph Hellwig
2024-03-27 17:14 ` Jason Gunthorpe
2024-03-07 6:01 ` Zhu Yanjun
2024-04-09 20:39 ` Zhu Yanjun
2024-05-02 23:32 ` Zeng, Oak
2024-05-03 11:57 ` Zhu Yanjun
2024-05-03 16:42 ` Jason Gunthorpe
2024-05-03 20:59 ` Zeng, Oak
2024-06-10 15:12 ` Zeng, Oak
2024-06-10 15:19 ` Zhu Yanjun
2024-06-10 16:18 ` Leon Romanovsky
2024-06-10 16:40 ` Zeng, Oak
2024-06-10 17:25 ` Jason Gunthorpe
2024-06-10 21:28 ` Zeng, Oak
2024-06-11 7:49 ` Zhu Yanjun
2024-06-11 15:45 ` Leon Romanovsky
2024-06-11 18:26 ` Zeng, Oak
2024-06-11 19:11 ` Leon Romanovsky
2024-06-11 15:39 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240307150505.GA28978@lst.de \
--to=hch@lst.de \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=amir73il@gmail.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chaitanyak@nvidia.com \
--cc=corbet@lwn.net \
--cc=damien.lemoal@opensource.wdc.com \
--cc=dan.j.williams@intel.com \
--cc=daniel@iogearbox.net \
--cc=iommu@lists.linux.dev \
--cc=jack@suse.com \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=joro@8bytes.org \
--cc=josef@toxicpanda.com \
--cc=kbusch@kernel.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=leon@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-rdma@vger.kernel.org \
--cc=m.szyprowski@samsung.com \
--cc=martin.petersen@oracle.com \
--cc=robin.murphy@arm.com \
--cc=sagi@grimberg.me \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=will@kernel.org \
--cc=yishaih@nvidia.com \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).