netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: David Howells <dhowells@redhat.com>, Andrew Lunn <andrew@lunn.ch>
Cc: Eric Dumazet <edumazet@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Christoph Hellwig <hch@infradead.org>,
	willy@infradead.org, Christian Brauner <brauner@kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Miklos Szeredi <mszeredi@redhat.com>,
	torvalds@linux-foundation.org, netdev@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
Date: Mon, 12 May 2025 23:59:24 +0200	[thread overview]
Message-ID: <bb31c07a-0b70-4bca-9c59-42f6233791cd@redhat.com> (raw)
In-Reply-To: <2135907.1747061490@warthog.procyon.org.uk>

On 12.05.25 16:51, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data

That should not be possible. Neither the child nor the parent can modify 
the page. Any write attempt will result in Copy-on-Write.

The issue is that if the parent writes to some unrelated part of the 
page after fork() but before DIO completed, the parent will trigger 
Copy-on-Write and the DIO will essentially be lost from the parent's POV 
(goes to the wrong page).


> probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).
> 
> (For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
> think we're probably fine - assuming the loopback driver doesn't give the
> receiver the transmitter's buffers to use directly...  This may be a big
> 'if'.)
> 
> Now, this probably wouldn't be a problem, but for the fact that one can also
> splice this stuff back *out* of the socket.
> 
> The same issues exist for pipes too.
> 
> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.
> 
> It seems reasonable that we should allow pinned memory spans to be queued in a
> socket or a pipe - that way, we only have to copy the data once in the event
> that the data is extracted with read(), recvmsg() or similar.  But if it's
> spliced out we then have all the fun of managing the lifetime - especially if
> it's a big transfer that gets split into bits.  In such a case, I wonder if we
> can just duplicate the memory at splice-out rather than trying to keep all the
> tracking intact.
> 
> If the memory was copied in, then moving the pages should be fine - though the
> memory may not be of a ref'able type (which would be fun if bits of such a
> page get spliced to different places).
> 
> I'm sure there is some app somewhere (fuse maybe?) where this would be a
> performance problem, though.
> 
> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.

IIRC, vmsplice() never does that optimization for that direction (map 
pinned page into the target process). It would be a mess.

But yes, vmsplice() should be using FOLL_PIN|FOLL_LONGTERM. Deprecation 
is unlikely to happen, I'm afraid :(

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2025-05-12 21:59 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48   ` David Hildenbrand
2025-05-02 14:21   ` Andrew Lunn
2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14     ` Jakub Kicinski
2025-05-06 13:50     ` David Howells
2025-05-06 13:56       ` Christoph Hellwig
2025-05-06 18:20       ` Jakub Kicinski
2025-05-07 13:45       ` David Howells
2025-05-07 17:47         ` Willem de Bruijn
2025-05-07 13:49       ` David Howells
2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59     ` David Hildenbrand [this message]
2025-06-23 11:50     ` Christian Brauner
2025-06-23 13:53     ` Christoph Hellwig
2025-06-23 14:16     ` David Howells
2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46     ` Christoph Hellwig
2025-06-23 23:38       ` Alistair Popple
2025-06-24  9:02     ` David Howells
2025-06-24 12:18       ` Jason Gunthorpe
2025-06-24 12:39       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb31c07a-0b70-4bca-9c59-42f6233791cd@redhat.com \
    --to=david@redhat.com \
    --cc=andrew@lunn.ch \
    --cc=brauner@kernel.org \
    --cc=davem@davemloft.net \
    --cc=dhowells@redhat.com \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=jhubbard@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mszeredi@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).