From: Jason Gunthorpe <jgg@nvidia.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
Joao Martins <joao.m.martins@oracle.com>,
John Hubbard <jhubbard@nvidia.com>,
Logan Gunthorpe <logang@deltatee.com>,
Ming Lei <ming.lei@redhat.com>,
linux-block@vger.kernel.org, netdev@vger.kernel.org,
linux-mm@kvack.org, linux-rdma@vger.kernel.org,
dri-devel@lists.freedesktop.org, nvdimm@lists.linux.dev
Subject: Re: Phyr Starter
Date: Tue, 11 Jan 2022 11:01:42 -0400 [thread overview]
Message-ID: <20220111150142.GL2328285@nvidia.com> (raw)
In-Reply-To: <Yd0IeK5s/E0fuWqn@casper.infradead.org>
On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> >
> > > Finally, it may be possible to stop using scatterlist to describe the
> > > input to the DMA-mapping operation. We may be able to get struct
> > > scatterlist down to just dma_address and dma_length, with chaining
> > > handled through an enclosing struct.
> >
> > Can you talk about this some more? IMHO one of the key properties of
> > the scatterlist is that it can hold huge amounts of pages without
> > having to do any kind of special allocation due to the chaining.
> >
> > The same will be true of the phyr idea right?
>
> My thinking is that we'd pass a relatively small array of phyr (maybe 16
> entries) to get_user_phyr(). If that turned out not to be big enough,
> then we have two options; one is to map those 16 ranges with sg and use
> the sg chaining functionality before throwing away the phyr and calling
> get_user_phyr() again.
Then we are we using get_user_phyr() at all if we are just storing it
in a sg?
Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.
I would like to see a flow more like:
cpu_phyr_list = get_user_phyr(uptr, 1G);
dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
[..]
dma_unmap_phyr(device, dma_phyr_list);
unpin_drity_free(cpu_phy_list);
Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.
ie no loops in drivers.
> The question is whether this is the right kind of optimisation to be
> doing. I hear you that we want a dense format, but it's questionable
> whether the kind of thing you're suggesting is actually denser than this
> scheme. For example, if we have 1GB pages and userspace happens to have
> allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> as a single phyr. A power-of-two scheme would have us use four entries
> (3, 4-7, 8-9, 10).
That is not quite what I had in mind..
struct phyr_list {
unsigned int first_page_offset_bytes;
size_t total_length_bytes;
phys_addr_t min_alignment;
struct packed_phyr *list_of_pages;
};
Where each 'packed_phyr' is an aligned page of some kind. The packing
has to be able to represent any number of pfns, so we have four major
cases:
- 4k pfns (use 8 bytes)
- Natural order pfn (use 8 bytes)
- 4k aligned pfns, arbitary number (use 12 bytes)
- <4k aligned, arbitary length (use 16 bytes?)
In all cases the interior pages are fully used, only the first and
last page is sliced based on the two parameters in the phyr_list.
The first_page_offset_bytes/total_length_bytes mean we don't need to
use the inefficient coding for many common cases, just stick to the 4k
coding and slice the first/last page down.
The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?
Basically think of list_of_pages as showing a contiguous list of at
least min_aligned pages and first_page_offset_bytes/total_length_bytes
taking a byte granular slice out of that logical range.
From a HW perspective I see two basic modalities:
- Streaming HW, which read/writes in a single pass (think
NVMe/storage/network). Usually takes a list of dma_addr_t and
length that HW just walks over. Rarely cares about things like page
boundaries. Optimization goal is to minimize list length. In this
case we map each packed_phyr into a HW SGL
- Random Access HW, which is randomly touching memory (think RDMA,
VFIO, DRM, IOMMU). Usually stores either a linear list of same-size
dma_addr_t pages, or a radix tree page table of dma_addr_t.
Needs to have a minimum alignment of each chunk (usually 4k) to
represent it. Optimization goal is to have maximum page size. In
this case we use min_alignment to size the HW array and decode the
packed_phyrs into individual pages.
> Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
> cheap.
With the above this still works, the very last entry in list_of_pages
would be the 12 byte pfn type and when we start a new page the logic
would then optimize it down to 8 bytes, if possible. At that point we
know we are not going to change it:
- An interior page that is up perfectly aligned is represented as a
natural order
- A starting page that ends on perfect alignment is widened to
natural order and first_page_offset_bytes is corrected
- An ending page that starts on perfect alignment is widened to
natural order and total_length_bytes is set
(though no harm in keeping the 12 byte represetation I suppose)
The main detail is to make the extra 4 bytes needed to store the
arbtiary pfn counts optional so when we don't need it, it isn't there.
> > VFIO would like this structure as well as it currently is a very
> > inefficient page at a time loop when it iommu maps things.
>
> I agree that you need these things. I think I'll run into trouble
> if I try to do them for you ... so I'm going to stop after doing the
> top end (pinning pages and getting them into an sg list) and let
> people who know that area better than I do work on that.
I agree, that is why I was asking for the datastructure 'phyr_list',
with chaining and so on.
I would imagine a few steps to this process:
1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
2) Wrapper around existing gup to get a phyr_list for user VA
3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
(However, with full performance for iommu passthrough)
4) Patches changing RDMA/VFIO/DRM to this API
5) Patches optimizing get_user_phyr()
6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
Obviously not all done by you.. I'm happy to help and take a swing at
the RDMA and VFIO parts.
I feel like we can go ahead with RDMA so long as the passthrough IOMMU
case is 100% efficient. VFIO is already maximually inefficient here,
so no worry there already. If VFIO and RDMA consume these APIs then
the server IOMMU driver owners should have a strong motivation to
implement.
My feeling is that at the core this project is about making a better
datastructure than scattterlist that can mostly replace it, then going
around the kernel and converting scatterlist users.
Given all the usage considerations I think it is an interesting
datastructure.
Jason
WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: nvdimm@lists.linux.dev, linux-rdma@vger.kernel.org,
John Hubbard <jhubbard@nvidia.com>,
linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
Ming Lei <ming.lei@redhat.com>,
linux-block@vger.kernel.org, linux-mm@kvack.org,
netdev@vger.kernel.org, Joao Martins <joao.m.martins@oracle.com>,
Logan Gunthorpe <logang@deltatee.com>,
Christoph Hellwig <hch@lst.de>
Subject: Re: Phyr Starter
Date: Tue, 11 Jan 2022 11:01:42 -0400 [thread overview]
Message-ID: <20220111150142.GL2328285@nvidia.com> (raw)
In-Reply-To: <Yd0IeK5s/E0fuWqn@casper.infradead.org>
On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> >
> > > Finally, it may be possible to stop using scatterlist to describe the
> > > input to the DMA-mapping operation. We may be able to get struct
> > > scatterlist down to just dma_address and dma_length, with chaining
> > > handled through an enclosing struct.
> >
> > Can you talk about this some more? IMHO one of the key properties of
> > the scatterlist is that it can hold huge amounts of pages without
> > having to do any kind of special allocation due to the chaining.
> >
> > The same will be true of the phyr idea right?
>
> My thinking is that we'd pass a relatively small array of phyr (maybe 16
> entries) to get_user_phyr(). If that turned out not to be big enough,
> then we have two options; one is to map those 16 ranges with sg and use
> the sg chaining functionality before throwing away the phyr and calling
> get_user_phyr() again.
Then we are we using get_user_phyr() at all if we are just storing it
in a sg?
Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.
I would like to see a flow more like:
cpu_phyr_list = get_user_phyr(uptr, 1G);
dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
[..]
dma_unmap_phyr(device, dma_phyr_list);
unpin_drity_free(cpu_phy_list);
Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.
ie no loops in drivers.
> The question is whether this is the right kind of optimisation to be
> doing. I hear you that we want a dense format, but it's questionable
> whether the kind of thing you're suggesting is actually denser than this
> scheme. For example, if we have 1GB pages and userspace happens to have
> allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> as a single phyr. A power-of-two scheme would have us use four entries
> (3, 4-7, 8-9, 10).
That is not quite what I had in mind..
struct phyr_list {
unsigned int first_page_offset_bytes;
size_t total_length_bytes;
phys_addr_t min_alignment;
struct packed_phyr *list_of_pages;
};
Where each 'packed_phyr' is an aligned page of some kind. The packing
has to be able to represent any number of pfns, so we have four major
cases:
- 4k pfns (use 8 bytes)
- Natural order pfn (use 8 bytes)
- 4k aligned pfns, arbitary number (use 12 bytes)
- <4k aligned, arbitary length (use 16 bytes?)
In all cases the interior pages are fully used, only the first and
last page is sliced based on the two parameters in the phyr_list.
The first_page_offset_bytes/total_length_bytes mean we don't need to
use the inefficient coding for many common cases, just stick to the 4k
coding and slice the first/last page down.
The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?
Basically think of list_of_pages as showing a contiguous list of at
least min_aligned pages and first_page_offset_bytes/total_length_bytes
taking a byte granular slice out of that logical range.
From a HW perspective I see two basic modalities:
- Streaming HW, which read/writes in a single pass (think
NVMe/storage/network). Usually takes a list of dma_addr_t and
length that HW just walks over. Rarely cares about things like page
boundaries. Optimization goal is to minimize list length. In this
case we map each packed_phyr into a HW SGL
- Random Access HW, which is randomly touching memory (think RDMA,
VFIO, DRM, IOMMU). Usually stores either a linear list of same-size
dma_addr_t pages, or a radix tree page table of dma_addr_t.
Needs to have a minimum alignment of each chunk (usually 4k) to
represent it. Optimization goal is to have maximum page size. In
this case we use min_alignment to size the HW array and decode the
packed_phyrs into individual pages.
> Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
> cheap.
With the above this still works, the very last entry in list_of_pages
would be the 12 byte pfn type and when we start a new page the logic
would then optimize it down to 8 bytes, if possible. At that point we
know we are not going to change it:
- An interior page that is up perfectly aligned is represented as a
natural order
- A starting page that ends on perfect alignment is widened to
natural order and first_page_offset_bytes is corrected
- An ending page that starts on perfect alignment is widened to
natural order and total_length_bytes is set
(though no harm in keeping the 12 byte represetation I suppose)
The main detail is to make the extra 4 bytes needed to store the
arbtiary pfn counts optional so when we don't need it, it isn't there.
> > VFIO would like this structure as well as it currently is a very
> > inefficient page at a time loop when it iommu maps things.
>
> I agree that you need these things. I think I'll run into trouble
> if I try to do them for you ... so I'm going to stop after doing the
> top end (pinning pages and getting them into an sg list) and let
> people who know that area better than I do work on that.
I agree, that is why I was asking for the datastructure 'phyr_list',
with chaining and so on.
I would imagine a few steps to this process:
1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
2) Wrapper around existing gup to get a phyr_list for user VA
3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
(However, with full performance for iommu passthrough)
4) Patches changing RDMA/VFIO/DRM to this API
5) Patches optimizing get_user_phyr()
6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
Obviously not all done by you.. I'm happy to help and take a swing at
the RDMA and VFIO parts.
I feel like we can go ahead with RDMA so long as the passthrough IOMMU
case is 100% efficient. VFIO is already maximually inefficient here,
so no worry there already. If VFIO and RDMA consume these APIs then
the server IOMMU driver owners should have a strong motivation to
implement.
My feeling is that at the core this project is about making a better
datastructure than scattterlist that can mostly replace it, then going
around the kernel and converting scatterlist users.
Given all the usage considerations I think it is an interesting
datastructure.
Jason
next prev parent reply other threads:[~2022-01-11 15:02 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-10 19:34 Phyr Starter Matthew Wilcox
2022-01-10 19:34 ` Matthew Wilcox
2022-01-11 0:41 ` Jason Gunthorpe
2022-01-11 0:41 ` Jason Gunthorpe
2022-01-11 4:32 ` Matthew Wilcox
2022-01-11 4:32 ` Matthew Wilcox
2022-01-11 15:01 ` Jason Gunthorpe [this message]
2022-01-11 15:01 ` Jason Gunthorpe
2022-01-11 18:33 ` Matthew Wilcox
2022-01-11 18:33 ` Matthew Wilcox
2022-01-11 20:21 ` Jason Gunthorpe
2022-01-11 20:21 ` Jason Gunthorpe
2022-01-11 21:25 ` Matthew Wilcox
2022-01-11 21:25 ` Matthew Wilcox
2022-01-11 22:09 ` Logan Gunthorpe
2022-01-11 22:09 ` Logan Gunthorpe
2022-01-11 22:57 ` Jason Gunthorpe
2022-01-11 22:57 ` Jason Gunthorpe
2022-01-11 23:02 ` Logan Gunthorpe
2022-01-11 23:02 ` Logan Gunthorpe
2022-01-11 22:53 ` Jason Gunthorpe
2022-01-11 22:53 ` Jason Gunthorpe
2022-01-11 22:57 ` Logan Gunthorpe
2022-01-11 22:57 ` Logan Gunthorpe
2022-01-11 23:02 ` Jason Gunthorpe
2022-01-11 23:02 ` Jason Gunthorpe
2022-01-11 23:08 ` Logan Gunthorpe
2022-01-11 23:08 ` Logan Gunthorpe
2022-01-12 18:37 ` Matthew Wilcox
2022-01-12 18:37 ` Matthew Wilcox
2022-01-12 19:08 ` Jason Gunthorpe
2022-01-12 19:08 ` Jason Gunthorpe
2022-01-20 14:03 ` Christoph Hellwig
2022-01-20 17:17 ` Jason Gunthorpe
2022-01-20 17:17 ` Jason Gunthorpe
2022-01-20 14:00 ` Christoph Hellwig
2022-01-11 9:05 ` Daniel Vetter
2022-01-11 9:05 ` Daniel Vetter
2022-01-11 20:26 ` Jason Gunthorpe
2022-01-11 20:26 ` Jason Gunthorpe
2022-01-20 14:09 ` Christoph Hellwig
2022-01-20 13:56 ` Christoph Hellwig
2022-01-20 15:27 ` Keith Busch
2022-01-20 15:27 ` Keith Busch
2022-01-20 15:28 ` Christoph Hellwig
2022-01-20 17:54 ` Robin Murphy
2022-01-11 8:17 ` John Hubbard
2022-01-11 8:17 ` John Hubbard
2022-01-11 14:01 ` Matthew Wilcox
2022-01-11 14:01 ` Matthew Wilcox
2022-01-11 15:02 ` Jason Gunthorpe
2022-01-11 15:02 ` Jason Gunthorpe
2022-01-11 17:31 ` Logan Gunthorpe
2022-01-11 17:31 ` Logan Gunthorpe
2022-01-20 14:12 ` Christoph Hellwig
2022-01-20 21:35 ` John Hubbard
2022-01-20 21:35 ` John Hubbard
2022-01-11 11:40 ` Thomas Zimmermann
2022-01-11 13:56 ` Matthew Wilcox
2022-01-11 13:56 ` Matthew Wilcox
2022-01-11 14:10 ` Thomas Zimmermann
2022-01-20 13:39 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220111150142.GL2328285@nvidia.com \
--to=jgg@nvidia.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@lst.de \
--cc=jhubbard@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=logang@deltatee.com \
--cc=ming.lei@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=nvdimm@lists.linux.dev \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.