How much is checksumming done in the kernel vs on the NIC?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How much is checksumming done in the kernel vs on the NIC?
@ 2025-05-02 12:07 David Howells
  2025-05-02 13:09 ` Andrew Lunn
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
  0 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2025-05-02 12:07 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski; +Cc: dhowells, willy, netdev

Hi Dave, Jakub,

I'm looking into making the sendmsg() code properly handle the 'DIO vs fork'
issue (where pages need pinning rather than refs taken) and also getting rid
of the taking of refs entirely as the page refcount is going to go away in the
relatively near future.

I'm wondering quite how to do the approach, and I was wondering if you have
any idea about the following:

 (1) How much do we need to do packet checksumming in the kernel these days
     rather than offloading it to the NIC?

 (2) How often do modern kernels encounter NICs that can only take a single
     {pointer,len} extent for any particular packet rather than a list of
     such?

Thanks,
David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How much is checksumming done in the kernel vs on the NIC?
  2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
@ 2025-05-02 13:09 ` Andrew Lunn
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Lunn @ 2025-05-02 13:09 UTC (permalink / raw)
  To: David Howells; +Cc: David S. Miller, Jakub Kicinski, willy, netdev

On Fri, May 02, 2025 at 01:07:01PM +0100, David Howells wrote:
> Hi Dave, Jakub,
> 
> I'm looking into making the sendmsg() code properly handle the 'DIO vs fork'
> issue (where pages need pinning rather than refs taken) and also getting rid
> of the taking of refs entirely as the page refcount is going to go away in the
> relatively near future.

Sorry, new to this conversation, and i don't know what you mean by DIO
vs fork. Could you point me at a discussion.

> I'm wondering quite how to do the approach, and I was wondering if you have
> any idea about the following:
> 
>  (1) How much do we need to do packet checksumming in the kernel these days
>      rather than offloading it to the NIC?
> 
>  (2) How often do modern kernels encounter NICs that can only take a single
>      {pointer,len} extent for any particular packet rather than a list of
>      such?

You might need to narrow this down to classes for NICs.

Some 'NICs' embedded in SOCs don't have scatter gather. Some
automotive NICs transfer data via SPI, which in theory could make a
linked list of SPI transfer requests per {pointer,len}, and hand it
over to the SPI core as a single operation, but in practice the MAC
driver tends to do this scatter/gather by hand.

There are some NICs which get confused when you add extra headers near
the beginning of the packet, so cannot perform checksumming deeper
than the FCS. IP, UDP checksum has to be done in software, etc.

Modern kernels still support NICs from the 90s, original Donald Becker
drivers, so you cannot assume too much from the hardware.

	Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* MSG_ZEROCOPY and the O_DIRECT vs fork() race
  2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
  2025-05-02 13:09 ` Andrew Lunn
@ 2025-05-02 13:41 ` David Howells
  2025-05-02 13:48   ` David Hildenbrand
                     ` (4 more replies)
  1 sibling, 5 replies; 24+ messages in thread
From: David Howells @ 2025-05-02 13:41 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: dhowells, David Hildenbrand, John Hubbard, David S. Miller,
	Jakub Kicinski, willy, netdev, linux-mm

Andrew Lunn <andrew@lunn.ch> wrote:

> > I'm looking into making the sendmsg() code properly handle the 'DIO vs
> > fork' issue (where pages need pinning rather than refs taken) and also
> > getting rid of the taking of refs entirely as the page refcount is going
> > to go away in the relatively near future.
> 
> Sorry, new to this conversation, and i don't know what you mean by DIO
> vs fork.

As I understand it, there's a race between O_DIRECT I/O and fork whereby if
you, say, start a DIO read operation on a page and then fork, the target page
gets attached to child and a copy made for the parent (because the refcount is
elevated by the I/O) - and so only the child sees the result.  This is made
more interesting by such as AIO where the parent gets the completion
notification, but not the data.

Further, a DIO write is then alterable by the child if the DMA has not yet
happened.

One of the things mm/gup.c does is to work around this issue...  However, I
don't think that MSG_ZEROCOPY handles this - and so zerocopy sendmsg is, I
think, subject to the same race.

> Could you point me at a discussion.

I don't know of one, offhand, apart from in the logs for mm/gup.c.  I've added
a couple more mm guys and the mm list to the cc: field.

The information in the description of fc1d8e7cca2daa18d2fe56b94874848adf89d7f5
may be relevant.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MSG_ZEROCOPY and the O_DIRECT vs fork() race
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
@ 2025-05-02 13:48   ` David Hildenbrand
  2025-05-02 14:21   ` Andrew Lunn
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-05-02 13:48 UTC (permalink / raw)
  To: David Howells, Andrew Lunn
  Cc: John Hubbard, David S. Miller, Jakub Kicinski, willy, netdev,
	linux-mm

On 02.05.25 15:41, David Howells wrote:
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
>>> I'm looking into making the sendmsg() code properly handle the 'DIO vs
>>> fork' issue (where pages need pinning rather than refs taken) and also
>>> getting rid of the taking of refs entirely as the page refcount is going
>>> to go away in the relatively near future.
>>
>> Sorry, new to this conversation, and i don't know what you mean by DIO
>> vs fork.
> 
> As I understand it, there's a race between O_DIRECT I/O and fork whereby if
> you, say, start a DIO read operation on a page and then fork, the target page
> gets attached to child and a copy made for the parent (because the refcount is
> elevated by the I/O) - and so only the child sees the result.  This is made
> more interesting by such as AIO where the parent gets the completion
> notification, but not the data.
> 
> Further, a DIO write is then alterable by the child if the DMA has not yet
> happened.
> 
> One of the things mm/gup.c does is to work around this issue...  However, I
> don't think that MSG_ZEROCOPY handles this - and so zerocopy sendmsg is, I
> think, subject to the same race.

If it's using FOLL_PIN it works. If not, it's still to be fixed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MSG_ZEROCOPY and the O_DIRECT vs fork() race
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
  2025-05-02 13:48   ` David Hildenbrand
@ 2025-05-02 14:21   ` Andrew Lunn
  2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Andrew Lunn @ 2025-05-02 14:21 UTC (permalink / raw)
  To: David Howells
  Cc: David Hildenbrand, John Hubbard, David S. Miller, Jakub Kicinski,
	willy, netdev, linux-mm

On Fri, May 02, 2025 at 02:41:46PM +0100, David Howells wrote:
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > I'm looking into making the sendmsg() code properly handle the 'DIO vs
> > > fork' issue (where pages need pinning rather than refs taken) and also
> > > getting rid of the taking of refs entirely as the page refcount is going
> > > to go away in the relatively near future.
> > 
> > Sorry, new to this conversation, and i don't know what you mean by DIO
> > vs fork.
> 
> As I understand it, there's a race between O_DIRECT I/O and fork whereby if
> you, say, start a DIO read operation on a page and then fork, the target page
> gets attached to child and a copy made for the parent (because the refcount is
> elevated by the I/O) - and so only the child sees the result.  This is made
> more interesting by such as AIO where the parent gets the completion
> notification, but not the data.
> 
> Further, a DIO write is then alterable by the child if the DMA has not yet
> happened.
> 
> One of the things mm/gup.c does is to work around this issue...  However, I
> don't think that MSG_ZEROCOPY handles this - and so zerocopy sendmsg is, I
> think, subject to the same race.

For zerocopy, you probably should be talking to Eric Dumazet, David Wei.

I don't know too much about this, but from the Ethernet drivers
perspective, i _think_ it has no idea about zero copy. It is just
passed a skbuf containing data, nothing special about it. Once the
interface says it is on the wire, the driver tells the netdev core it
has finished with the skbuf.

So, i guess your question about CRC is to do with CoW? If the driver
does not touch the data, just DMA it out, the page could be shared
between the processes. If it needs to modify it, put CRCs into the
packet, that write means the page cannot be shared? If you have
scatter/gather you can place the headers in kernel memory and do
writes to set the CRCs without touching the userspace data. I don't
know, but i suspect this is how it is done. There is also an skbuf
operation to linearize a packet, which will allocate a new skbuf big
enough to contain the whole packet in a single segment, and do a
memcpy of the fragments. Not what you want for zerocopy, but if your
interface does not have the needed support, there is not much choice.

	Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Reorganising how the networking layer handles memory
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
  2025-05-02 13:48   ` David Hildenbrand
  2025-05-02 14:21   ` Andrew Lunn
@ 2025-05-02 16:21   ` David Howells
  2025-05-05 20:14     ` Jakub Kicinski
  2025-05-06 13:50     ` David Howells
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
  2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
  4 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2025-05-02 16:21 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: dhowells, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy, netdev,
	linux-mm

Okay, perhaps I should start at the beginning :-).

There a number of things that are going to mandate an overhaul of how the
networking layer handles memory:

 (1) The sk_buff code assumes it can take refs on pages it is given, but the
     page ref counter is going to go away in the relatively near term.

     Indeed, you're already not allowed to take a ref on, say, slab memory,
     because the page ref doesn't control the lifetime of the object.

     Even pages are going to kind of go away.  Willy haz planz...

 (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it
     doesn't use page pinning.  It needs to use the GUP routines.

 (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
     used with certain memory types (e.g. slab).  It takes a ref on whatever
     it is given - which is wrong if it should pin this instead.

 (4) iov_iter extraction will probably change to dispensing {physaddr,len}
     tuples rather than {page,off,len} tuples.  The socket layer won't then
     see pages at all.

 (5) Memory segments splice()'d into a socket may have who-knows-what weird
     lifetime requirements.

So after discussions at LSF/MM, what I'm proposing is this:

 (1) If we want to use zerocopy we (the kernel) have to pass a cleanup
     function to sendmsg() along with the data.  If you don't care about
     zerocopy, it will copy the data.

 (2) For each message sent with sendmsg, the cleanup function is called
     progressively as parts of the data it included are completed.  I would do
     it progressively so that big messages can be handled.

 (3) We also pass an optional 'refill' function to sendmsg.  As data is sent,
     the code that extracts the data will call this to pin more user bufs (we
     don't necessarily want to pin everything up front).  The refill function
     is permitted to sleep to allow the amount of pinned memory to subside.

 (4) We move a lot the zerocopy wrangling code out of the basement of the
     networking code and put it at the system call level, above the call to
     ->sendmsg() and the basement code then calls the appropriate functions to
     extract, refill and clean up.  It may be usable in other contexts too -
     DIO to regular files, for example.

 (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
     the cleanup function.

 (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it
     refers to.  This is done for it by the zerocopy layer.

This will allow us to kill three birds with one stone:

 (A) It will fix the issues with zerocopy transmission mentioned above (DIO vs
     fork, pin vs ref, pages without refcounts).  Whoever calls sendmsg() is
     then responsible for maintaining the lifetime of the memory by whatever
     means necessary.

 (B) Kernel drivers (e.g. network filesystems) can then use MSG_ZEROCOPY
     (MSG_SPLICE_PAGES can be discarded).  They can create their own message,
     cobbling it together out of kmalloc'd memory and arrays of pages, safe in
     the knowledge that the network stack will treat it only as an array of
     memory fragments.

     They would supply their own cleanup function to do the appropriate folio
     putting and would not need a "refill" function.  The extraction can be
     handled by standard iov_iter code.

     This would allow a network filesystem to transmit a complete RPC message
     with a single sendmsg() call, avoiding the need to cork the socket.

 (C) Make it easier to provide alternative userspace notification mechanisms
     than SO_EE_ORIGIN_ZEROCOPY.  Maybe by allowing a "cookie" to be passed in
     the control message that can be passed back by some other mechanism
     (e.g. recvmsg).  Or by providing a user address that can be altered and a
     futex triggered on it.

There's potentially a fourth bird too, but I'm not sure how practical they
are:

 (D) What if TCP and UDP sockets, say, *only* do zerocopy?  And that the
     syscall layer does the buffering transparently to hide that from the
     user?  That could massively simplify the lower layers and perhaps make
     the buffering more efficient.

     For instance, the data could be organised by the top layer into (large)
     pages and then the protocol would divide that up.  Smaller chunks that
     need to go immediately could be placed in kmalloc'd buffers rather than
     using a page frag allocator.

     There are some downsides/difficulties too.  Firstly, it would probably
     render the checksum-whilst-copying impossible (though being able to use
     CPU copy acceleration might make up for that, as might checksum offload).

     Secondly, it would mean that sk_buffs would have at least two fragments -
     header and data - which might be impossible for some NICs.

     Thirdly, some protocols just want to copy the data into their own skbuffs
     whatever.

There are also some issues with this proposal:

 (1) AF_ALG.  This does its own thing, including direct I/O without
     MSG_ZEROCOPY being set.  It doesn't actually use sk_buffs.  Really, it's
     not a network protocol in the normal sense and might have been better
     implemented as, say, a bunch of functions in io_uring.

 (2) Packet crypto.  Some protocols might want to do encryption from the
     source buffers into the skbuff and this would amount to a duplicate copy.

     This might be made more complicated by things like the TLS upper level
     protocol on TCP where we might be able to offload the crypto to the NIC,
     but might have to do it ourselves.

 (3) Is it possible to have a mixture of zerocopy and non-zerocopy pieces in
     the same sk_buff?  If there's a mixture, it would be possible to deal
     with the non-zerocopy bit by allocating a zerocopy record and setting
     the cleanup function just to free it.

Implementation notes:

 (1) What I'm thinking is that there will be an 'iov_manager' struct that
     manages a single call to sendmsg().  This will be refcounted and carry
     the completion state (kind of like ubuf_info) and the cleanup function
     pointer.

 (2) The upper layer will wrap iov_manager in its own thing (kind of like
     ubuf_info_msgzc).

 (3) For sys_sendmsg(), sys_sendmmsg() and io_uring() use a 'syscall-level
     manager' that will progressively pin and unpin userspace buffers.

     (a) This will keep a list of the memory fragments it currently has pinned
     	 in a rolling buffer.  It has to be able to find them to unpin them
     	 and it has to allow for the userspace addresses having been remapped
     	 or unmapped.

     (b) As its refill function gets called, the manager will pin more pages
     	 and add them to the producer end of the buffer.

     (c) These can then be extracted by the protocol into skbuffs.

     (d) As its cleanup function gets called, it will advance the consumer end
     	 and unpin/discard memory ranges that are consumed.

     I'm not sure how much drag this might add to performance, though, so it
     will need to be tried and benchmarked.

 (4) Possibly, the list of fragments can be made directly available through an
     iov_iter type and a subset attached directly to a sk_buff.

 (5) SOCK_STREAM sockets will keep an ordered list of manager structs, each
     tagged with the byte transmission sequence range for that struct.  The
     socket will keep a transmission completion sequence counter and as the
     counter advances through the manager list, their cleanup functions will
     be called and, ultimately, they'll be detached and put.

 (6) SOCK_DGRAM sockets will keep a list of manager structs on the sk_buff as
     well as on the socket.  The problem is that they may complete out of
     order, but SO_EE_ORIGIN_ZEROCOPY works by byte position.  Each time a
     sk_buff completes, all the managers attached to it are marked complete,
     but complete managers can only get cleaned up when they reach the front
     of the queue.

 (7) Kernel services will wrap iov_manager in their own wrapper and will pass
     down iterator that describes their message in its entirety through an
     iov_iter.

Finally, this doesn't cover recvmsg() zerocopy, which might also have some of
the same issues.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
@ 2025-05-05 20:14     ` Jakub Kicinski
  2025-05-06 13:50     ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2025-05-05 20:14 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, David Hildenbrand,
	John Hubbard, Christoph Hellwig, willy, netdev, linux-mm,
	Willem de Bruijn

On Fri, 02 May 2025 17:21:48 +0100 David Howells wrote:
> Okay, perhaps I should start at the beginning :-).

Thanks :) Looks like Eric is CCed, also adding Willem.
The direction of using ubuf_info makes sense to me.
Random comments below on the little I know.

> There a number of things that are going to mandate an overhaul of how the
> networking layer handles memory:
> 
>  (1) The sk_buff code assumes it can take refs on pages it is given, but the
>      page ref counter is going to go away in the relatively near term.
> 
>      Indeed, you're already not allowed to take a ref on, say, slab memory,
>      because the page ref doesn't control the lifetime of the object.
> 
>      Even pages are going to kind of go away.  Willy haz planz...

I think the part NVMe folks run into is the BPF integration layer
called skmsg. It's supposed to be a BPF-based "data router", at
the socket layer, before any protocol processing, so it tries
to do its own page ref accounting..

>  (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it
>      doesn't use page pinning.  It needs to use the GUP routines.

We end up calling iov_iter_get_pages2(). Is it not setting
FOLL_PIN is a conscious choice, or nobody cared until now?

>  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
>      used with certain memory types (e.g. slab).  It takes a ref on whatever
>      it is given - which is wrong if it should pin this instead.

s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
a ref  to the stack, right? But yes, the networking stack will try to
release it.

>  (4) iov_iter extraction will probably change to dispensing {physaddr,len}
>      tuples rather than {page,off,len} tuples.  The socket layer won't then
>      see pages at all.
>
>  (5) Memory segments splice()'d into a socket may have who-knows-what weird
>      lifetime requirements.
> 
> So after discussions at LSF/MM, what I'm proposing is this:
> 
>  (1) If we want to use zerocopy we (the kernel) have to pass a cleanup
>      function to sendmsg() along with the data.  If you don't care about
>      zerocopy, it will copy the data.

TAL at struct ubuf_info

>  (2) For each message sent with sendmsg, the cleanup function is called
>      progressively as parts of the data it included are completed.  I would do
>      it progressively so that big messages can be handled.
> 
>  (3) We also pass an optional 'refill' function to sendmsg.  As data is sent,
>      the code that extracts the data will call this to pin more user bufs (we
>      don't necessarily want to pin everything up front).  The refill function
>      is permitted to sleep to allow the amount of pinned memory to subside.

Why not feed the data as you get the notifications for completion?

>  (4) We move a lot the zerocopy wrangling code out of the basement of the
>      networking code and put it at the system call level, above the call to
>      ->sendmsg() and the basement code then calls the appropriate functions to  
>      extract, refill and clean up.  It may be usable in other contexts too -
>      DIO to regular files, for example.
> 
>  (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
>      the cleanup function.

Already the case? :)

>  (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it
>      refers to.  This is done for it by the zerocopy layer.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
  2025-05-05 20:14     ` Jakub Kicinski
@ 2025-05-06 13:50     ` David Howells
  2025-05-06 13:56       ` Christoph Hellwig
                         ` (3 more replies)
  1 sibling, 4 replies; 24+ messages in thread
From: David Howells @ 2025-05-06 13:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy, netdev,
	linux-mm, Willem de Bruijn

Jakub Kicinski <kuba@kernel.org> wrote:

> > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> >      it doesn't use page pinning.  It needs to use the GUP routines.
> 
> We end up calling iov_iter_get_pages2(). Is it not setting
> FOLL_PIN is a conscious choice, or nobody cared until now?

iov_iter_get_pages*() predates GUP, I think.  There's now an
iov_iter_extract_pages() that does the pinning stuff, but you have to do a
different cleanup, which is why I created a new API call.

iov_iter_extract_pages() also does no pinning at all on pages extracted from a
non-user iterator (e.g. ITER_BVEC).

> 
> >  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> >      used with certain memory types (e.g. slab).  It takes a ref on whatever
> >      it is given - which is wrong if it should pin this instead.
> 
> s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
> a ref  to the stack, right? But yes, the networking stack will try to
> release it.

I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
needs to be changed.

Christoph Hellwig would like to make it such that the extractor gets
{phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
is that you've got a span of memory to use as your buffer.  How that span of
memory is managed is the responsibility of whoever called sendmsg() - and they
need a callback to be able to handle that.

> TAL at struct ubuf_info

I've looked at it, yes, however, I'm wondering if we can make it more generic
and usable by regular file DIO and splice also.

Further, we need a way to track pages we've pinned.  One way to do that is to
simply rely on the sk_buff fragment array and keep track of which particular
bits need putting/unpinning/freeing/kfreeing/etc - but really that should be
handled by the caller unless it costs too much performance (which it might).

Once advantage of delegating it to the caller, though, and having the caller
keep track of which bits in still needs to hold on to by transmission
completion position is that we don't need to manage refs/pins across sk_buff
duplication - let alone what we should do with stuff that's kmalloc'd.

> >  (3) We also pass an optional 'refill' function to sendmsg.  As data is
> >      sent, the code that extracts the data will call this to pin more user
> >      bufs (we don't necessarily want to pin everything up front).  The
> >      refill function is permitted to sleep to allow the amount of pinned
> >      memory to subside.
> 
> Why not feed the data as you get the notifications for completion?

Because there are multiple factors that govern the size of the chunks in which
the refilling is done:

 (1) We want to get user pages in batches to reduce the cost of the
     synchronisation MM has to do.  Further, the individual spans in the
     batches will be of variable size (folios can be of different sizes, for
     example).  The idea of the 'refill' is that we go and refill as each
     batch is transcribed into skbuffs.

 (2) We don't want to run extraction too far ahead as that will delay the
     onset of transmission.

 (3) We don't want to pin too much at any one time as that builds memory
     pressure and in the worst case will cause OOM conditions.

So we need to balance things - particularly (1) and (2) - and accept that we
may get multiple refils in order to fill the socket transmission buffer.

> >  (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
> >      the cleanup function.
> 
> Already the case? :)

This is more a note-to-self, but in what I'm thinking of doing would have the
sendmsg() handler inserting SO_EE_ORIGIN_ZEROCOPY into the socket receive
queue.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-06 13:50     ` David Howells
@ 2025-05-06 13:56       ` Christoph Hellwig
  2025-05-06 18:20       ` Jakub Kicinski
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:56 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, Andrew Lunn, Eric Dumazet, David S. Miller,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy, netdev,
	linux-mm, Willem de Bruijn

On Tue, May 06, 2025 at 02:50:49PM +0100, David Howells wrote:
> > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > >      it doesn't use page pinning.  It needs to use the GUP routines.
> > 
> > We end up calling iov_iter_get_pages2(). Is it not setting
> > FOLL_PIN is a conscious choice, or nobody cared until now?
> 
> iov_iter_get_pages*() predates GUP, I think.

It predates pin_user_pages, but get_user_pages is much older.

> There's now an
> iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> different cleanup, which is why I created a new API call.

But yes, iov_iter_get_pages* needs to go away in favour of
iov_iter_extract_pages, and I'm still annoyed that despite multiple
pings no one has done any work on that outside of block / block based
direct I/O and netfs.

> > >  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> > >      used with certain memory types (e.g. slab).  It takes a ref on whatever
> > >      it is given - which is wrong if it should pin this instead.
> > 
> > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
> > a ref  to the stack, right? But yes, the networking stack will try to
> > release it.
> 
> I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
> needs to be changed.
> 
> Christoph Hellwig would like to make it such that the extractor gets
> {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
> is that you've got a span of memory to use as your buffer.  How that span of
> memory is managed is the responsibility of whoever called sendmsg() - and they
> need a callback to be able to handle that.

Not sure what the extractor is, but we plan to change the bio_vec
to be physical address instead of page+offset based.  Where we is
a lot more people than just me.

> Once advantage of delegating it to the caller, though, and having the caller
> keep track of which bits in still needs to hold on to by transmission
> completion position is that we don't need to manage refs/pins across sk_buff
> duplication - let alone what we should do with stuff that's kmalloc'd.

And the callers already do that for all other kinds of I/O anyway.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-06 13:50     ` David Howells
  2025-05-06 13:56       ` Christoph Hellwig
@ 2025-05-06 18:20       ` Jakub Kicinski
  2025-05-07 13:45       ` David Howells
  2025-05-07 13:49       ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2025-05-06 18:20 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, David Hildenbrand,
	John Hubbard, Christoph Hellwig, willy, netdev, linux-mm,
	Willem de Bruijn

On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > >      it doesn't use page pinning.  It needs to use the GUP routines.  
> > 
> > We end up calling iov_iter_get_pages2(). Is it not setting
> > FOLL_PIN is a conscious choice, or nobody cared until now?  
> 
> iov_iter_get_pages*() predates GUP, I think.  There's now an
> iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> different cleanup, which is why I created a new API call.
> 
> iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> non-user iterator (e.g. ITER_BVEC).

FWIW it occurred to me after hitting send that we may not care. 
We're talking about Tx, so the user pages are read only for the kernel.
I don't think we have the "child gets the read data" problem?

> > >  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> > >      used with certain memory types (e.g. slab).  It takes a ref on whatever
> > >      it is given - which is wrong if it should pin this instead.  
> > 
> > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
> > a ref  to the stack, right? But yes, the networking stack will try to
> > release it.  
> 
> I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
> needs to be changed.
> 
> Christoph Hellwig would like to make it such that the extractor gets
> {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
> is that you've got a span of memory to use as your buffer.  How that span of
> memory is managed is the responsibility of whoever called sendmsg() - and they
> need a callback to be able to handle that.

Sure, there may be things to iron out as data in networking is not
opaque. We need to handle the firewalling and inspection cases.
Likely all this will work well for ZC but not sure if we can "convert"
the stack to phyaddr+len.

> > TAL at struct ubuf_info  
> 
> I've looked at it, yes, however, I'm wondering if we can make it more generic
> and usable by regular file DIO and splice also.

Okay, just keep in mind that we are working on 800Gbps NIC support these
days, and MTU does not grow. So whatever we do - it must be fast fast.

> Further, we need a way to track pages we've pinned.  One way to do that is to
> simply rely on the sk_buff fragment array and keep track of which particular
> bits need putting/unpinning/freeing/kfreeing/etc - but really that should be
> handled by the caller unless it costs too much performance (which it might).
> 
> Once advantage of delegating it to the caller, though, and having the caller
> keep track of which bits in still needs to hold on to by transmission
> completion position is that we don't need to manage refs/pins across sk_buff
> duplication - let alone what we should do with stuff that's kmalloc'd.
> 
> > >  (3) We also pass an optional 'refill' function to sendmsg.  As data is
> > >      sent, the code that extracts the data will call this to pin more user
> > >      bufs (we don't necessarily want to pin everything up front).  The
> > >      refill function is permitted to sleep to allow the amount of pinned
> > >      memory to subside.  
> > 
> > Why not feed the data as you get the notifications for completion?  
> 
> Because there are multiple factors that govern the size of the chunks in which
> the refilling is done:
> 
>  (1) We want to get user pages in batches to reduce the cost of the
>      synchronisation MM has to do.  Further, the individual spans in the
>      batches will be of variable size (folios can be of different sizes, for
>      example).  The idea of the 'refill' is that we go and refill as each
>      batch is transcribed into skbuffs.
> 
>  (2) We don't want to run extraction too far ahead as that will delay the
>      onset of transmission.
> 
>  (3) We don't want to pin too much at any one time as that builds memory
>      pressure and in the worst case will cause OOM conditions.
> 
> So we need to balance things - particularly (1) and (2) - and accept that we
> may get multiple refils in order to fill the socket transmission buffer.

Hard to comment without concrete workload at hand.
Ideally the interface would be good enough for the application
to dependably drive the transmission in an efficient way.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-06 13:50     ` David Howells
  2025-05-06 13:56       ` Christoph Hellwig
  2025-05-06 18:20       ` Jakub Kicinski
@ 2025-05-07 13:45       ` David Howells
  2025-05-07 17:47         ` Willem de Bruijn
  2025-05-07 13:49       ` David Howells
  3 siblings, 1 reply; 24+ messages in thread
From: David Howells @ 2025-05-07 13:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy, netdev,
	linux-mm, Willem de Bruijn

Jakub Kicinski <kuba@kernel.org> wrote:

> On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> > Jakub Kicinski <kuba@kernel.org> wrote:
> > > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > > >      it doesn't use page pinning.  It needs to use the GUP routines.  
> > > 
> > > We end up calling iov_iter_get_pages2(). Is it not setting
> > > FOLL_PIN is a conscious choice, or nobody cared until now?  
> > 
> > iov_iter_get_pages*() predates GUP, I think.  There's now an
> > iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> > different cleanup, which is why I created a new API call.
> > 
> > iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> > non-user iterator (e.g. ITER_BVEC).
> 
> FWIW it occurred to me after hitting send that we may not care. 
> We're talking about Tx, so the user pages are read only for the kernel.
> I don't think we have the "child gets the read data" problem?

Worse: if the child alters the data in the buffer to be transmitted after the
fork() (say it calls free() and malloc()), it can do so; if the parent tries
that, there will be no effect.

> Likely all this will work well for ZC but not sure if we can "convert"
> the stack to phyaddr+len.

Me neither.  We also use bio_vec[] to hold lists of memory and then trawl them
to do cleanup, but a conversion to holding {phys,len} will mandate being able
to do some sort of reverse lookup.

> Okay, just keep in mind that we are working on 800Gbps NIC support these
> days, and MTU does not grow. So whatever we do - it must be fast fast.

Crazy:-)

One thing I've noticed in the uring stuff is that it doesn't seem to like the
idea of having an sk_buff pointing to more than one ubuf_info, presumably
because the sk_buff will point to the ubuf_info holding the zerocopyable data.
Is that actually necessary for SOCK_STREAM, though?

My thought for SOCK_STREAM is to have an ordered list of zerocopy source
records on the socket and a completion counter and not tag the skbuffs at all.
That way, an skbuff can carry data for multiple zerocopy send requests.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-06 13:50     ` David Howells
                         ` (2 preceding siblings ...)
  2025-05-07 13:45       ` David Howells
@ 2025-05-07 13:49       ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2025-05-07 13:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jakub Kicinski, Andrew Lunn, Eric Dumazet,
	David S. Miller, David Hildenbrand, John Hubbard, willy, netdev,
	linux-mm, Willem de Bruijn

Christoph Hellwig <hch@infradead.org> wrote:

> > Christoph Hellwig would like to make it such that the extractor gets
> > {phyaddr,len} rather than {page,off,len} - so all you, the network layer,
> > see is that you've got a span of memory to use as your buffer.  How that
> > span of memory is managed is the responsibility of whoever called
> > sendmsg() - and they need a callback to be able to handle that.
> 
> Not sure what the extractor is

Just a function that tries to get information out of the iov_iter.  In the
case of the networking layer, something like zerocopy_fill_skb_from_iter()
that calls iov_iter_get_pages2() currently.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reorganising how the networking layer handles memory
  2025-05-07 13:45       ` David Howells
@ 2025-05-07 17:47         ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2025-05-07 17:47 UTC (permalink / raw)
  To: David Howells, Jakub Kicinski
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy, netdev,
	linux-mm, Willem de Bruijn

David Howells wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> > > Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > > > >      it doesn't use page pinning.  It needs to use the GUP routines.  
> > > > 
> > > > We end up calling iov_iter_get_pages2(). Is it not setting
> > > > FOLL_PIN is a conscious choice, or nobody cared until now?  
> > > 
> > > iov_iter_get_pages*() predates GUP, I think.  There's now an
> > > iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> > > different cleanup, which is why I created a new API call.
> > > 
> > > iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> > > non-user iterator (e.g. ITER_BVEC).
> > 
> > FWIW it occurred to me after hitting send that we may not care. 
> > We're talking about Tx, so the user pages are read only for the kernel.
> > I don't think we have the "child gets the read data" problem?
> 
> Worse: if the child alters the data in the buffer to be transmitted after the
> fork() (say it calls free() and malloc()), it can do so; if the parent tries
> that, there will be no effect.
> 
> > Likely all this will work well for ZC but not sure if we can "convert"
> > the stack to phyaddr+len.
> 
> Me neither.  We also use bio_vec[] to hold lists of memory and then trawl them
> to do cleanup, but a conversion to holding {phys,len} will mandate being able
> to do some sort of reverse lookup.
> 
> > Okay, just keep in mind that we are working on 800Gbps NIC support these
> > days, and MTU does not grow. So whatever we do - it must be fast fast.
> 
> Crazy:-)
> 
> One thing I've noticed in the uring stuff is that it doesn't seem to like the
> idea of having an sk_buff pointing to more than one ubuf_info, presumably
> because the sk_buff will point to the ubuf_info holding the zerocopyable data.
> Is that actually necessary for SOCK_STREAM, though?

In MSG_ZEROCOPY this limitation of at most one ubuf_info per skb was
chosen just because it was simpler and sufficient.

A single skb can still combine skb frags from multiple consecutive
sendmsg requests, including multiple separate MSG_ZEROCOPY calls.
Because the ubuf_info notification is for a range of bytes.

There is a rare edge case in skb_zerocopy_iter_stream that detects
two ubuf_infos on a single skb.

                /* An skb can only point to one uarg. This edge case happens
                 * when TCP appends to an skb, but zerocopy_realloc triggered
                 * a new alloc.
                 */     
                if (orig_uarg && uarg != orig_uarg)
                        return -EEXIST;

Instead TCP then just creates a new skb.
This will result in smaller skbs than otherwise. But as said is rare.

> My thought for SOCK_STREAM is to have an ordered list of zerocopy source
> records on the socket and a completion counter and not tag the skbuffs at all.
> That way, an skbuff can carry data for multiple zerocopy send requests.
> 
> David
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
                     ` (2 preceding siblings ...)
  2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
@ 2025-05-12 14:51   ` David Howells
  2025-05-12 21:59     ` David Hildenbrand
                       ` (3 more replies)
  2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
  4 siblings, 4 replies; 24+ messages in thread
From: David Howells @ 2025-05-12 14:51 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: dhowells, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

I'm looking at how to make sendmsg() handle page pinning - and also working
towards supporting the page refcount eventually being removed and only being
available with certain memory types.

One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
refs on it before it attaches it to an sk_buff.  Without this, if memory is
spliced into an AF_UNIX socket and then the process forks, that memory gets
attached to the child process, and the child can alter the data, probably by
accident, if the memory is on the stack or in the heap.

Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
an AF_UNIX pipe (though I'm not sure if anyone actually does this).

(For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
think we're probably fine - assuming the loopback driver doesn't give the
receiver the transmitter's buffers to use directly...  This may be a big
'if'.)

Now, this probably wouldn't be a problem, but for the fact that one can also
splice this stuff back *out* of the socket.

The same issues exist for pipes too.

The question is what should happen here to a memory span for which the network
layer or pipe driver is not allowed to take reference, but rather must call a
destructor?  Particularly if, say, it's just a small part of a larger span.

It seems reasonable that we should allow pinned memory spans to be queued in a
socket or a pipe - that way, we only have to copy the data once in the event
that the data is extracted with read(), recvmsg() or similar.  But if it's
spliced out we then have all the fun of managing the lifetime - especially if
it's a big transfer that gets split into bits.  In such a case, I wonder if we
can just duplicate the memory at splice-out rather than trying to keep all the
tracking intact.

If the memory was copied in, then moving the pages should be fine - though the
memory may not be of a ref'able type (which would be fun if bits of such a
page get spliced to different places).

I'm sure there is some app somewhere (fuse maybe?) where this would be a
performance problem, though.

And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
pipe.  That should also pin memory.  It may also be possible to vmsplice a
pinned page into the target process's VM or a page from a memory span with
some other type of destruction.  I don't suppose we can deprecate vmsplice()?

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
@ 2025-05-12 21:59     ` David Hildenbrand
  2025-06-23 11:50     ` Christian Brauner
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-05-12 21:59 UTC (permalink / raw)
  To: David Howells, Andrew Lunn
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, John Hubbard,
	Christoph Hellwig, willy, Christian Brauner, Al Viro,
	Miklos Szeredi, torvalds, netdev, linux-mm, linux-fsdevel,
	linux-kernel

On 12.05.25 16:51, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data

That should not be possible. Neither the child nor the parent can modify 
the page. Any write attempt will result in Copy-on-Write.

The issue is that if the parent writes to some unrelated part of the 
page after fork() but before DIO completed, the parent will trigger 
Copy-on-Write and the DIO will essentially be lost from the parent's POV 
(goes to the wrong page).


> probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).
> 
> (For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
> think we're probably fine - assuming the loopback driver doesn't give the
> receiver the transmitter's buffers to use directly...  This may be a big
> 'if'.)
> 
> Now, this probably wouldn't be a problem, but for the fact that one can also
> splice this stuff back *out* of the socket.
> 
> The same issues exist for pipes too.
> 
> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.
> 
> It seems reasonable that we should allow pinned memory spans to be queued in a
> socket or a pipe - that way, we only have to copy the data once in the event
> that the data is extracted with read(), recvmsg() or similar.  But if it's
> spliced out we then have all the fun of managing the lifetime - especially if
> it's a big transfer that gets split into bits.  In such a case, I wonder if we
> can just duplicate the memory at splice-out rather than trying to keep all the
> tracking intact.
> 
> If the memory was copied in, then moving the pages should be fine - though the
> memory may not be of a ref'able type (which would be fun if bits of such a
> page get spliced to different places).
> 
> I'm sure there is some app somewhere (fuse maybe?) where this would be a
> performance problem, though.
> 
> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.

IIRC, vmsplice() never does that optimization for that direction (map 
pinned page into the target process). It would be a mess.

But yes, vmsplice() should be using FOLL_PIN|FOLL_LONGTERM. Deprecation 
is unlikely to happen, I'm afraid :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
                     ` (3 preceding siblings ...)
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
@ 2025-06-23 10:50   ` David Howells
  2025-06-23 13:46     ` Christoph Hellwig
  2025-06-24  9:02     ` David Howells
  4 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2025-06-23 10:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel

Hi Christoph,

Looking at the DMA address mapping infrastructure, it makes use of the page
struct to access the physical address (which obviously shouldn't be a problem)
and to find out if the page is involved in P2P DMA.

dma_direct_map_page() calls is_pci_p2pdma_page():

	static inline bool is_pci_p2pdma_page(const struct page *page)
	{
		return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
			is_zone_device_page(page) &&
			page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
	}

What's the best way to manage this without having to go back to the page
struct for every DMA mapping we want to make?  Do we need to have
iov_extract_user_pages() note this in the bio_vec?

	struct bio_vec {
		physaddr_t	bv_base_addr;	/* 64-bits */
		size_t		bv_len:56;	/* Maybe just u32 */
		bool		p2pdma:1;	/* Region is involved in P2P */
		unsigned int	spare:7;
	};

I'm guessing that only folio-type pages can be involved in this:

	static inline struct dev_pagemap *page_pgmap(const struct page *page)
	{
		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
		return page_folio(page)->pgmap;
	}

as only struct folio has a pointer to dev_pagemap?  And I assume this is going
to get removed from struct page itself at some point soonish.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
  2025-05-12 21:59     ` David Hildenbrand
@ 2025-06-23 11:50     ` Christian Brauner
  2025-06-23 13:53     ` Christoph Hellwig
  2025-06-23 14:16     ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2025-06-23 11:50 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Al Viro, Miklos Szeredi, torvalds, netdev, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, May 12, 2025 at 03:51:30PM +0100, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data, probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).

I would possible be interested in using this for the coredump af_unix socket.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
@ 2025-06-23 13:46     ` Christoph Hellwig
  2025-06-23 23:38       ` Alistair Popple
  2025-06-24  9:02     ` David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2025-06-23 13:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

Hi David,

On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> What's the best way to manage this without having to go back to the page
> struct for every DMA mapping we want to make?

There isn't a very easy way.  Also because if you actually need to do
peer to peer transfers, you right now absolutely need the page to find
the pgmap that has the information on how to perform the peer to peer
transfer.

> Do we need to have
> iov_extract_user_pages() note this in the bio_vec?
> 
> 	struct bio_vec {
> 		physaddr_t	bv_base_addr;	/* 64-bits */
> 		size_t		bv_len:56;	/* Maybe just u32 */
> 		bool		p2pdma:1;	/* Region is involved in P2P */
> 		unsigned int	spare:7;
> 	};

Having a flag in the bio_vec might be a way to shortcut the P2P or not
decision a bit.  The downside is that without the flag, the bio_vec
in the brave new page-less world would actually just be:

	struct bio_vec {
		phys_addr_t	bv_phys;
		u32		bv_len;
	} __packed;

i.e. adding any more information would actually increase the size from
12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
undo all the memory savings that this move would provide.

Note that at least for the block layer the DMA mapping changes I'm about
to send out again require each bio to be either non P2P or P2P to a
specific device.  It might be worth to also extend this higher level
limitation to other users if feasible.

> I'm guessing that only folio-type pages can be involved in this:
> 
> 	static inline struct dev_pagemap *page_pgmap(const struct page *page)
> 	{
> 		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
> 		return page_folio(page)->pgmap;
> 	}
> 
> as only struct folio has a pointer to dev_pagemap?  And I assume this is going
> to get removed from struct page itself at some point soonish.

I guess so.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
  2025-05-12 21:59     ` David Hildenbrand
  2025-06-23 11:50     ` Christian Brauner
@ 2025-06-23 13:53     ` Christoph Hellwig
  2025-06-23 14:16     ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-06-23 13:53 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Lunn, Eric Dumazet, David S. Miller, Jakub Kicinski,
	David Hildenbrand, John Hubbard, Christoph Hellwig, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, May 12, 2025 at 03:51:30PM +0100, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.

Yes, that would be great.

> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.

What is a "span" in this context?  In general splice unlike direct I/O
relies on page reference counts inside the splice machinery.  But that is
configurable through the pipe_buf_operations.  So if you want something
to be handled by splice that does not use simple page refcounts you need
special pipe_buf_operations for it.  And you'd better have a really good
use case for this to be worthwhile.

> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.  I don't suppose we can deprecate vmsplice()?

You'll need a longterm pin for vmsplice.  I'd love to deprecate it,
but I doubt it's going to go away any time soon if ever.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
  2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
                       ` (2 preceding siblings ...)
  2025-06-23 13:53     ` Christoph Hellwig
@ 2025-06-23 14:16     ` David Howells
  3 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2025-06-23 14:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, willy,
	Christian Brauner, Al Viro, Miklos Szeredi, torvalds, netdev,
	linux-mm, linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor?  Particularly if, say, it's just a small part of a
> > larger span.
> 
> What is a "span" in this context?

In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory.  Basically just a contiguous range
of physical addresses, if you prefer.

However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes.  Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.

Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called.  The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.

Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).

And then there's relayfs and fuse, which seem to do weird stuff.

For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point.  The problem is that the
kernel doesn't know what's going to happen next to it.

> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery.  But that is configurable through the
> pipe_buf_operations.  So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it.  And you'd better have a really good use case for this to be worthwhile.

Yes.  vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 13:46     ` Christoph Hellwig
@ 2025-06-23 23:38       ` Alistair Popple
  0 siblings, 0 replies; 24+ messages in thread
From: Alistair Popple @ 2025-06-23 23:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

On Mon, Jun 23, 2025 at 06:46:47AM -0700, Christoph Hellwig wrote:
> Hi David,
> 
> On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > What's the best way to manage this without having to go back to the page
> > struct for every DMA mapping we want to make?
> 
> There isn't a very easy way.  Also because if you actually need to do
> peer to peer transfers, you right now absolutely need the page to find
> the pgmap that has the information on how to perform the peer to peer
> transfer.
> 
> > Do we need to have
> > iov_extract_user_pages() note this in the bio_vec?
> > 
> > 	struct bio_vec {
> > 		physaddr_t	bv_base_addr;	/* 64-bits */
> > 		size_t		bv_len:56;	/* Maybe just u32 */
> > 		bool		p2pdma:1;	/* Region is involved in P2P */
> > 		unsigned int	spare:7;
> > 	};
> 
> Having a flag in the bio_vec might be a way to shortcut the P2P or not
> decision a bit.  The downside is that without the flag, the bio_vec
> in the brave new page-less world would actually just be:
> 
> 	struct bio_vec {
> 		phys_addr_t	bv_phys;
> 		u32		bv_len;
> 	} __packed;
> 
> i.e. adding any more information would actually increase the size from
> 12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
> undo all the memory savings that this move would provide.
> 
> Note that at least for the block layer the DMA mapping changes I'm about
> to send out again require each bio to be either non P2P or P2P to a
> specific device.  It might be worth to also extend this higher level
> limitation to other users if feasible.
> 
> > I'm guessing that only folio-type pages can be involved in this:
> > 
> > 	static inline struct dev_pagemap *page_pgmap(const struct page *page)
> > 	{
> > 		VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
> > 		return page_folio(page)->pgmap;
> > 	}
> > 
> > as only struct folio has a pointer to dev_pagemap?  And I assume this is going
> > to get removed from struct page itself at some point soonish.
> 
> I guess so.

It already has been as the struct page field was renamed due to higher order
folios needing the struct page dev_pgmap for compound_head. Obviously for
order-0 folios the folio/page pgmap fields are in practice the same but I
suppose that will change once struct page is shrunk.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
  2025-06-23 13:46     ` Christoph Hellwig
@ 2025-06-24  9:02     ` David Howells
  2025-06-24 12:18       ` Jason Gunthorpe
  2025-06-24 12:39       ` Christoph Hellwig
  1 sibling, 2 replies; 24+ messages in thread
From: David Howells @ 2025-06-24  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > What's the best way to manage this without having to go back to the page
> > struct for every DMA mapping we want to make?
> 
> There isn't a very easy way.  Also because if you actually need to do
> peer to peer transfers, you right now absolutely need the page to find
> the pgmap that has the information on how to perform the peer to peer
> transfer.

Are you expecting P2P to become particularly common?  Because page struct
lookups will become more expensive because we'll have to do type checking and
Willy may eventually move them from a fixed array into a maple tree - so if we
can record the P2P flag in the bio_vec, it would help speed up the "not P2P"
case.

> > Do we need to have
> > iov_extract_user_pages() note this in the bio_vec?
> > 
> > 	struct bio_vec {
> > 		physaddr_t	bv_base_addr;	/* 64-bits */
> > 		size_t		bv_len:56;	/* Maybe just u32 */
> > 		bool		p2pdma:1;	/* Region is involved in P2P */
> > 		unsigned int	spare:7;
> > 	};
> 
> Having a flag in the bio_vec might be a way to shortcut the P2P or not
> decision a bit.  The downside is that without the flag, the bio_vec
> in the brave new page-less world would actually just be:
> 
> 	struct bio_vec {
> 		phys_addr_t	bv_phys;
> 		u32		bv_len;
> 	} __packed;
> 
> i.e. adding any more information would actually increase the size from
> 12 bytes to 16 bytes for the usualy 64-bit phys_addr_t setups, and thus
> undo all the memory savings that this move would provide.

Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
capped at a bit less than 2GiB?  Could we, say, do:

 	struct bio_vec {
 		phys_addr_t	bv_phys;
 		u32		bv_len:31;
		u32		bv_use_p2p:1;
 	} __packed;

And rather than storing the how-to-do-P2P info in the page struct, does it
make sense to hold it separately, keyed on bv_phys?

Also, is it possible for the networking stack, say, to trivially map the P2P
memory in order to checksum it?  I presume bv_phys in that case would point to
a mapping of device memory?

Thanks,
David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-24  9:02     ` David Howells
@ 2025-06-24 12:18       ` Jason Gunthorpe
  2025-06-24 12:39       ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2025-06-24 12:18 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe

On Tue, Jun 24, 2025 at 10:02:05AM +0100, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Jun 23, 2025 at 11:50:58AM +0100, David Howells wrote:
> > > What's the best way to manage this without having to go back to the page
> > > struct for every DMA mapping we want to make?
> > 
> > There isn't a very easy way.  Also because if you actually need to do
> > peer to peer transfers, you right now absolutely need the page to find
> > the pgmap that has the information on how to perform the peer to peer
> > transfer.
> 
> Are you expecting P2P to become particularly common?  

It is becoming common place in certain kinds of server system
types. If half the system's memory is behind PCI on a GPU or something
then you need P2P.

> Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
> capped at a bit less than 2GiB?  Could we, say, do:
> 
>  	struct bio_vec {
>  		phys_addr_t	bv_phys;
>  		u32		bv_len:31;
> 		u32		bv_use_p2p:1;
>  	} __packed;
> 
> And rather than storing the how-to-do-P2P info in the page struct, does it
> make sense to hold it separately, keyed on bv_phys?

I though we had agreed these sorts of 'mixed transfers' were not
desirable and we want things to be uniform at this lowest level.

So, I suggest the bio_vec should be entirely uniform, either it is all
CPU memory or it is all P2P from the same source. This is what the
block stack is doing by holding the P2P flag in the bio and splitting
the bios when they are constructed.

My intention to make a more general, less performant, API was to copy
what bio is doing and have a list of bio_vecs, each bio_vec having the
same properties.

The struct enclosing the bio_vec (the bio, etc) would have the the
flag if it is p2p and some way to get the needed p2p source metadata.

The bio_vec itself would just store physical addresses and lengths. No
need for complicated bit slicing.

I think this is important because the new DMA API really doesn't want
to be changing modes on a per-item basis..

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to handle P2P DMA with only {physaddr,len} in bio_vec?
  2025-06-24  9:02     ` David Howells
  2025-06-24 12:18       ` Jason Gunthorpe
@ 2025-06-24 12:39       ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-06-24 12:39 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Lunn, Eric Dumazet, David S. Miller,
	Jakub Kicinski, David Hildenbrand, John Hubbard, Mina Almasry,
	willy, Christian Brauner, Al Viro, netdev, linux-mm,
	linux-fsdevel, linux-kernel, Leon Romanovsky, Logan Gunthorpe,
	Jason Gunthorpe

On Tue, Jun 24, 2025 at 10:02:05AM +0100, David Howells wrote:
> > There isn't a very easy way.  Also because if you actually need to do
> > peer to peer transfers, you right now absolutely need the page to find
> > the pgmap that has the information on how to perform the peer to peer
> > transfer.
> 
> Are you expecting P2P to become particularly common?

What do you mean with 'particularly common'?  In general it's a very
niche thing.  But in certain niches it gets used more and more.

> Because page struct
> lookups will become more expensive because we'll have to do type checking and
> Willy may eventually move them from a fixed array into a maple tree - so if we
> can record the P2P flag in the bio_vec, it would help speed up the "not P2P"
> case.

As said before, the best place for that is a higher level structure than
the bio_vec.

> Do we actually need 32 bits for bv_len, especially given that MAX_RW_COUNT is
> capped at a bit less than 2GiB?  Could we, say, do:
> 
>  	struct bio_vec {
>  		phys_addr_t	bv_phys;
>  		u32		bv_len:31;
> 		u32		bv_use_p2p:1;
>  	} __packed;

I've already heard people complain 32-bit might not be enough :) 

> And rather than storing the how-to-do-P2P info in the page struct, does it
> make sense to hold it separately, keyed on bv_phys?

Maybe.  But then you need to invent your own new refcounting for the
section representing the hot pluggable p2p memory.

> Also, is it possible for the networking stack, say, to trivially map the P2P
> memory in order to checksum it?  I presume bv_phys in that case would point to
> a mapping of device memory?

P2P is always to MMIO regions.  So you can access it using the usual
MMIO helpers.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-24 12:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48   ` David Hildenbrand
2025-05-02 14:21   ` Andrew Lunn
2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14     ` Jakub Kicinski
2025-05-06 13:50     ` David Howells
2025-05-06 13:56       ` Christoph Hellwig
2025-05-06 18:20       ` Jakub Kicinski
2025-05-07 13:45       ` David Howells
2025-05-07 17:47         ` Willem de Bruijn
2025-05-07 13:49       ` David Howells
2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59     ` David Hildenbrand
2025-06-23 11:50     ` Christian Brauner
2025-06-23 13:53     ` Christoph Hellwig
2025-06-23 14:16     ` David Howells
2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46     ` Christoph Hellwig
2025-06-23 23:38       ` Alistair Popple
2025-06-24  9:02     ` David Howells
2025-06-24 12:18       ` Jason Gunthorpe
2025-06-24 12:39       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).