AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
       [not found]       ` <1021352.1746193306@warthog.procyon.org.uk>
@ 2025-05-12 14:51         ` David Howells
  2025-05-12 21:59           ` David Hildenbrand
                             ` (3 more replies)
  2025-06-23 10:50         ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
  1 sibling, 4 replies; 11+ messages in thread
From: David Howells @ 2025-05-12 14:51 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: dhowells, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

I'm looking at how to make sendmsg() handle page pinning - and also working
towards supporting the page refcount eventually being removed and only being
available with certain memory types.

One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
refs on it before it attaches it to an sk_buff.  Without this, if memory is
spliced into an AF_UNIX socket and then the process forks, that memory gets
attached to the child process, and the child can alter the data, probably by
accident, if the memory is on the stack or in the heap.

Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
an AF_UNIX pipe (though I'm not sure if anyone actually does this).

(For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
think we're probably fine - assuming the loopback driver doesn't give the
receiver the transmitter's buffers to use directly...  This may be a big
'if'.)

Now, this probably wouldn't be a problem, but for the fact that one can also
splice this stuff back *out* of the socket.

The same issues exist for pipes too.

The question is what should happen here to a memory span for which the network
layer or pipe driver is not allowed to take reference, but rather must call a
destructor?  Particularly if, say, it's just a small part of a larger span.

It seems reasonable that we should allow pinned memory spans to be queued in a
socket or a pipe - that way, we only have to copy the data once in the event
that the data is extracted with read(), recvmsg() or similar.  But if it's
spliced out we then have all the fun of managing the lifetime - especially if
it's a big transfer that gets split into bits.  In such a case, I wonder if we
can just duplicate the memory at splice-out rather than trying to keep all the
tracking intact.

If the memory was copied in, then moving the pages should be fine - though the
memory may not be of a ref'able type (which would be fun if bits of such a
page get spliced to different places).

I'm sure there is some app somewhere (fuse maybe?) where this would be a
performance problem, though.

And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
pipe.  That should also pin memory.  It may also be possible to vmsplice a
pinned page into the target process's VM or a page from a memory span with
some other type of destruction.  I don't suppose we can deprecate vmsplice()?

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
@ 2025-05-12 21:59           ` David Hildenbrand
  2025-06-23 11:50           ` Christian Brauner
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-05-12 21:59 UTC (permalink / raw)
  To: David Howells, Andrew Lunn
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, John Hubbard,
	Christoph Hellwig, willy, Christian Brauner, Al Viro,
	Miklos Szeredi, torvalds, netdev, linux-mm, linux-fsdevel,
	linux-kernel

On 12.05.25 16:51, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data

That should not be possible. Neither the child nor the parent can modify 
the page. Any write attempt will result in Copy-on-Write.

The issue is that if the parent writes to some unrelated part of the 
page after fork() but before DIO completed, the parent will trigger 
Copy-on-Write and the DIO will essentially be lost from the parent's POV 
(goes to the wrong page).


> probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).
> 
> (For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
> think we're probably fine - assuming the loopback driver doesn't give the
> receiver the transmitter's buffers to use directly...  This may be a big
> 'if'.)
> 
> Now, this probably wouldn't be a problem, but for the fact that one can also
> splice this stuff back *out* of the socket.
> 
> The same issues exist for pipes too.
> 
> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.
> 
> It seems reasonable that we should allow pinned memory spans to be queued in a
> socket or a pipe - that way, we only have to copy the data once in the event
> that the data is extracted with read(), recvmsg() or similar.  But if it's
> spliced out we then have all the fun of managing the lifetime - especially if
> it's a big transfer that gets split into bits.  In such a case, I wonder if we
> can just duplicate the memory at splice-out rather than trying to keep all the
> tracking intact.
> 
> If the memory was copied in, then moving the pages should be fine - though the
> memory may not be of a ref'able type (which would be fun if bits of such a
> page get spliced to different places).
> 
> I'm sure there is some app somewhere (fuse maybe?) where this would be a
> performance problem, though.
> 
> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.

IIRC, vmsplice() never does that optimization for that direction (map 
pinned page into the target process). It would be a mess.

But yes, vmsplice() should be using FOLL_PIN|FOLL_LONGTERM. Deprecation 
is unlikely to happen, I'm afraid :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
  2025-05-12 21:59           ` David Hildenbrand
@ 2025-06-23 11:50           ` Christian Brauner
  2025-06-23 13:53           ` Christoph Hellwig
  2025-06-23 14:16           ` David Howells
  3 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-06-23 11:50 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Al Viro, Miklos Szeredi, torvalds, netdev, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, May 12, 2025 at 03:51:30PM +0100, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data, probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).

I would possible be interested in using this for the coredump af_unix socket.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
  2025-05-12 21:59           ` David Hildenbrand
  2025-06-23 11:50           ` Christian Brauner
@ 2025-06-23 13:53           ` Christoph Hellwig
  2025-06-23 14:16           ` David Howells
  3 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2025-06-23 13:53 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, May 12, 2025 at 03:51:30PM +0100, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.

Yes, that would be great.

> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.

What is a "span" in this context?  In general splice unlike direct I/O
relies on page reference counts inside the splice machinery.  But that is
configurable through the pipe_buf_operations.  So if you want something
to be handled by splice that does not use simple page refcounts you need
special pipe_buf_operations for it.  And you'd better have a really good
use case for this to be worthwhile.

> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.  I don't suppose we can deprecate vmsplice()?

You'll need a longterm pin for vmsplice.  I'd love to deprecate it,
but I doubt it's going to go away any time soon if ever.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
                             ` (2 preceding siblings ...)
  2025-06-23 13:53           ` Christoph Hellwig
@ 2025-06-23 14:16           ` David Howells
  3 siblings, 0 replies; 11+ messages in thread
From: David Howells @ 2025-06-23 14:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor?  Particularly if, say, it's just a small part of a
> > larger span.
> 
> What is a "span" in this context?

In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory.  Basically just a contiguous range
of physical addresses, if you prefer.

However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes.  Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.

Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called.  The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.

Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).

And then there's relayfs and fuse, which seem to do weird stuff.

For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point.  The problem is that the
kernel doesn't know what's going to happen next to it.

> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery.  But that is configurable through the
> pipe_buf_operations.  So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it.  And you'd better have a really good use case for this to be worthwhile.

Yes.  vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* How to handle P2P DMA with only {physaddr,len} in bio_vec?
       [not found]       ` <1021352.1746193306@warthog.procyon.org.uk>
  2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
@ 2025-06-23 10:50         ` David Howells
  2025-06-23 13:46           ` Christoph Hellwig
  2025-06-24  9:02           ` David Howells
  1 sibling, 2 replies; 11+ messages in thread
From: David Howells @ 2025-06-23 10:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel

Hi Christoph,

Looking at the DMA address mapping infrastructure, it makes use of the page
struct to access the physical address (which obviously shouldn't be a problem)
and to find out if the page is involved in P2P DMA.

dma_direct_map_page() calls is_pci_p2pdma_page():

	static inline bool is_pci_p2pdma_page(const struct page *page)
	{
		return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
			is_zone_device_page(page) &&
			page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
	}

What's the best way to manage this without having to go back to the page
struct for every DMA mapping we want to make?  Do we need to have
iov_extract_user_pages() note this in the bio_vec?

	struct bio_vec {
		physaddr_t	bv_base_addr;	/* 64-bits */
		size_t		bv_len:56;	/* Maybe just u32 */
		bool		p2pdma:1;	/* Region is involved in P2P */
		unsigned int	spare:7;
	};

I'm guessing that only folio-type pages can be involved in this:

	static inline struct dev_pagemap *page_pgmap(const struct page *page)
	{
		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
		return page_folio(page)->pgmap;
	}

as only struct folio has a pointer to dev_pagemap?  And I assume this is going
to get removed from struct page itself at some point soonish.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 10:50         ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
@ 2025-06-23 13:46           ` Christoph Hellwig
  2025-06-23 23:38             ` Alistair Popple
  2025-06-24  9:02           ` David Howells
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2025-06-23 13:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

Hi David,

On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> What's the best way to manage this without having to go back to the page
> struct for every DMA mapping we want to make?

There isn't a very easy way.  Also because if you actually need to do
peer to peer transfers, you right now absolutely need the page to find
the pgmap that has the information on how to perform the peer to peer
transfer.

> Do we need to have
> iov_extract_user_pages() note this in the bio_vec?
> 
> 	struct bio_vec {
> 		physaddr_t	bv_base_addr;	/* 64-bits */
> 		size_t		bv_len:56;	/* Maybe just u32 */
> 		bool		p2pdma:1;	/* Region is involved in P2P */
> 		unsigned int	spare:7;
> 	};

Having a flag in the bio_vec might be a way to shortcut the P2P or not
decision a bit.  The downside is that without the flag, the bio_vec
in the brave new page-less world would actually just be:

	struct bio_vec {
		phys_addr_t	bv_phys;
		u32		bv_len;
	} __packed;

i.e. adding any more information would actually increase the size from
12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
undo all the memory savings that this move would provide.

Note that at least for the block layer the DMA mapping changes I'm about
to send out again require each bio to be either non P2P or P2P to a
specific device.  It might be worth to also extend this higher level
limitation to other users if feasible.

> I'm guessing that only folio-type pages can be involved in this:
> 
> 	static inline struct dev_pagemap *page_pgmap(const struct page *page)
> 	{
> 		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
> 		return page_folio(page)->pgmap;
> 	}
> 
> as only struct folio has a pointer to dev_pagemap?  And I assume this is going
> to get removed from struct page itself at some point soonish.

I guess so.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 13:46           ` Christoph Hellwig
@ 2025-06-23 23:38             ` Alistair Popple
  0 siblings, 0 replies; 11+ messages in thread
From: Alistair Popple @ 2025-06-23 23:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

On Mon, Jun 23, 2025 at 06:46:47AM -0700, Christoph Hellwig wrote:
> Hi David,
> 
> On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > What's the best way to manage this without having to go back to the page
> > struct for every DMA mapping we want to make?
> 
> There isn't a very easy way.  Also because if you actually need to do
> peer to peer transfers, you right now absolutely need the page to find
> the pgmap that has the information on how to perform the peer to peer
> transfer.
> 
> > Do we need to have
> > iov_extract_user_pages() note this in the bio_vec?
> > 
> > 	struct bio_vec {
> > 		physaddr_t	bv_base_addr;	/* 64-bits */
> > 		size_t		bv_len:56;	/* Maybe just u32 */
> > 		bool		p2pdma:1;	/* Region is involved in P2P */
> > 		unsigned int	spare:7;
> > 	};
> 
> Having a flag in the bio_vec might be a way to shortcut the P2P or not
> decision a bit.  The downside is that without the flag, the bio_vec
> in the brave new page-less world would actually just be:
> 
> 	struct bio_vec {
> 		phys_addr_t	bv_phys;
> 		u32		bv_len;
> 	} __packed;
> 
> i.e. adding any more information would actually increase the size from
> 12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
> undo all the memory savings that this move would provide.
> 
> Note that at least for the block layer the DMA mapping changes I'm about
> to send out again require each bio to be either non P2P or P2P to a
> specific device.  It might be worth to also extend this higher level
> limitation to other users if feasible.
> 
> > I'm guessing that only folio-type pages can be involved in this:
> > 
> > 	static inline struct dev_pagemap *page_pgmap(const struct page *page)
> > 	{
> > 		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
> > 		return page_folio(page)->pgmap;
> > 	}
> > 
> > as only struct folio has a pointer to dev_pagemap?  And I assume this is going
> > to get removed from struct page itself at some point soonish.
> 
> I guess so.

It already has been as the struct page field was renamed due to higher order
folios needing the struct page dev_pgmap for compound_head. Obviously for
order-0 folios the folio/page pgmap fields are in practice the same but I
suppose that will change once struct page is shrunk.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 10:50         ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
  2025-06-23 13:46           ` Christoph Hellwig
@ 2025-06-24  9:02           ` David Howells
  2025-06-24 12:18             ` Jason Gunthorpe
  2025-06-24 12:39             ` Christoph Hellwig
  1 sibling, 2 replies; 11+ messages in thread
From: David Howells @ 2025-06-24  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > What's the best way to manage this without having to go back to the page
> > struct for every DMA mapping we want to make?
> 
> There isn't a very easy way.  Also because if you actually need to do
> peer to peer transfers, you right now absolutely need the page to find
> the pgmap that has the information on how to perform the peer to peer
> transfer.

Are you expecting P2P to become particularly common?  Because page struct
lookups will become more expensive because we'll have to do type checking and
Willy may eventually move them from a fixed array into a maple tree - so if we
can record the P2P flag in the bio_vec, it would help speed up the "not P2P"
case.

> > Do we need to have
> > iov_extract_user_pages() note this in the bio_vec?
> > 
> > 	struct bio_vec {
> > 		physaddr_t	bv_base_addr;	/* 64-bits */
> > 		size_t		bv_len:56;	/* Maybe just u32 */
> > 		bool		p2pdma:1;	/* Region is involved in P2P */
> > 		unsigned int	spare:7;
> > 	};
> 
> Having a flag in the bio_vec might be a way to shortcut the P2P or not
> decision a bit.  The downside is that without the flag, the bio_vec
> in the brave new page-less world would actually just be:
> 
> 	struct bio_vec {
> 		phys_addr_t	bv_phys;
> 		u32		bv_len;
> 	} __packed;
> 
> i.e. adding any more information would actually increase the size from
> 12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
> undo all the memory savings that this move would provide.

Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
capped at a bit less than 2GiB?  Could we, say, do:

 	struct bio_vec {
 		phys_addr_t	bv_phys;
 		u32		bv_len:31;
		u32		bv_use_p2p:1;
 	} __packed;

And rather than storing the how-to-do-P2P info in the page struct, does it
make sense to hold it separately, keyed on bv_phys?

Also, is it possible for the networking stack, say, to trivially map the P2P
memory in order to checksum it?  I presume bv_phys in that case would point to
a mapping of device memory?

Thanks,
David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-24  9:02           ` David Howells
@ 2025-06-24 12:18             ` Jason Gunthorpe
  2025-06-24 12:39             ` Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2025-06-24 12:18 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe

On Tue, Jun 24, 2025 at 10:02:05AM +0100, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > > What's the best way to manage this without having to go back to the page
> > > struct for every DMA mapping we want to make?
> > 
> > There isn't a very easy way.  Also because if you actually need to do
> > peer to peer transfers, you right now absolutely need the page to find
> > the pgmap that has the information on how to perform the peer to peer
> > transfer.
> 
> Are you expecting P2P to become particularly common?  

It is becoming common place in certain kinds of server system
types. If half the system's memory is behind PCI on a GPU or something
then you need P2P.

> Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
> capped at a bit less than 2GiB?  Could we, say, do:
> 
>  	struct bio_vec {
>  		phys_addr_t	bv_phys;
>  		u32		bv_len:31;
> 		u32		bv_use_p2p:1;
>  	} __packed;
> 
> And rather than storing the how-to-do-P2P info in the page struct, does it
> make sense to hold it separately, keyed on bv_phys?

I though we had agreed these sorts of 'mixed transfers' were not
desirable and we want things to be uniform at this lowest level.

So, I suggest the bio_vec should be entirely uniform, either it is all
CPU memory or it is all P2P from the same source. This is what the
block stack is doing by holding the P2P flag in the bio and splitting
the bios when they are constructed.

My intention to make a more general, less performant, API was to copy
what bio is doing and have a list of bio_vecs, each bio_vec having the
same properties.

The struct enclosing the bio_vec (the bio, etc) would have the the
flag if it is p2p and some way to get the needed p2p source metadata.

The bio_vec itself would just store physical addresses and lengths. No
need for complicated bit slicing.

I think this is important because the new DMA API really doesn't want
to be changing modes on a per-item basis..

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-24  9:02           ` David Howells
  2025-06-24 12:18             ` Jason Gunthorpe
@ 2025-06-24 12:39             ` Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2025-06-24 12:39 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

On Tue, Jun 24, 2025 at 10:02:05AM +0100, David Howells wrote:
> > There isn't a very easy way.  Also because if you actually need to do
> > peer to peer transfers, you right now absolutely need the page to find
> > the pgmap that has the information on how to perform the peer to peer
> > transfer.
> 
> Are you expecting P2P to become particularly common?

What do you mean with 'particularly common'?  In general it's a very
niche thing.  But in certain niches it gets used more and more.

> Because page struct
> lookups will become more expensive because we'll have to do type checking and
> Willy may eventually move them from a fixed array into a maple tree - so if we
> can record the P2P flag in the bio_vec, it would help speed up the "not P2P"
> case.

As said before, the best place for that is a higher level structure than
the bio_vec.

> Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
> capped at a bit less than 2GiB?  Could we, say, do:
> 
>  	struct bio_vec {
>  		phys_addr_t	bv_phys;
>  		u32		bv_len:31;
> 		u32		bv_use_p2p:1;
>  	} __packed;

I've already heard people complain 32-bit might not be enough :) 

> And rather than storing the how-to-do-P2P info in the page struct, does it
> make sense to hold it separately, keyed on bv_phys?

Maybe.  But then you need to invent your own new refcounting for the
section representing the hot pluggable p2p memory.

> Also, is it possible for the networking stack, say, to trivially map the P2P
> memory in order to checksum it?  I presume bv_phys in that case would point to
> a mapping of device memory?

P2P is always to MMIO regions.  So you can access it using the usual
MMIO helpers.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-24 12:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1069540.1746202908@warthog.procyon.org.uk>
     [not found] ` <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch>
     [not found]   ` <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch>
     [not found]     ` <1015189.1746187621@warthog.procyon.org.uk>
     [not found]       ` <1021352.1746193306@warthog.procyon.org.uk>
2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59           ` David Hildenbrand
2025-06-23 11:50           ` Christian Brauner
2025-06-23 13:53           ` Christoph Hellwig
2025-06-23 14:16           ` David Howells
2025-06-23 10:50         ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46           ` Christoph Hellwig
2025-06-23 23:38             ` Alistair Popple
2025-06-24  9:02           ` David Howells
2025-06-24 12:18             ` Jason Gunthorpe
2025-06-24 12:39             ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).