From: Jakub Kicinski <kuba@kernel.org>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Eric Dumazet <edumazet@google.com>,
"David S. Miller" <davem@davemloft.net>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Christoph Hellwig <hch@infradead.org>,
willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org,
Willem de Bruijn <willemb@google.com>
Subject: Re: Reorganising how the networking layer handles memory
Date: Mon, 5 May 2025 13:14:46 -0700 [thread overview]
Message-ID: <20250505131446.7448e9bf@kernel.org> (raw)
In-Reply-To: <1069540.1746202908@warthog.procyon.org.uk>
On Fri, 02 May 2025 17:21:48 +0100 David Howells wrote:
> Okay, perhaps I should start at the beginning :-).
Thanks :) Looks like Eric is CCed, also adding Willem.
The direction of using ubuf_info makes sense to me.
Random comments below on the little I know.
> There a number of things that are going to mandate an overhaul of how the
> networking layer handles memory:
>
> (1) The sk_buff code assumes it can take refs on pages it is given, but the
> page ref counter is going to go away in the relatively near term.
>
> Indeed, you're already not allowed to take a ref on, say, slab memory,
> because the page ref doesn't control the lifetime of the object.
>
> Even pages are going to kind of go away. Willy haz planz...
I think the part NVMe folks run into is the BPF integration layer
called skmsg. It's supposed to be a BPF-based "data router", at
the socket layer, before any protocol processing, so it tries
to do its own page ref accounting..
> (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it
> doesn't use page pinning. It needs to use the GUP routines.
We end up calling iov_iter_get_pages2(). Is it not setting
FOLL_PIN is a conscious choice, or nobody cared until now?
> (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> used with certain memory types (e.g. slab). It takes a ref on whatever
> it is given - which is wrong if it should pin this instead.
s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants
a ref to the stack, right? But yes, the networking stack will try to
release it.
> (4) iov_iter extraction will probably change to dispensing {physaddr,len}
> tuples rather than {page,off,len} tuples. The socket layer won't then
> see pages at all.
>
> (5) Memory segments splice()'d into a socket may have who-knows-what weird
> lifetime requirements.
>
> So after discussions at LSF/MM, what I'm proposing is this:
>
> (1) If we want to use zerocopy we (the kernel) have to pass a cleanup
> function to sendmsg() along with the data. If you don't care about
> zerocopy, it will copy the data.
TAL at struct ubuf_info
> (2) For each message sent with sendmsg, the cleanup function is called
> progressively as parts of the data it included are completed. I would do
> it progressively so that big messages can be handled.
>
> (3) We also pass an optional 'refill' function to sendmsg. As data is sent,
> the code that extracts the data will call this to pin more user bufs (we
> don't necessarily want to pin everything up front). The refill function
> is permitted to sleep to allow the amount of pinned memory to subside.
Why not feed the data as you get the notifications for completion?
> (4) We move a lot the zerocopy wrangling code out of the basement of the
> networking code and put it at the system call level, above the call to
> ->sendmsg() and the basement code then calls the appropriate functions to
> extract, refill and clean up. It may be usable in other contexts too -
> DIO to regular files, for example.
>
> (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
> the cleanup function.
Already the case? :)
> (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it
> refers to. This is done for it by the zerocopy layer.
next prev parent reply other threads:[~2025-05-05 20:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48 ` David Hildenbrand
2025-05-02 14:21 ` Andrew Lunn
2025-05-02 16:21 ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14 ` Jakub Kicinski [this message]
2025-05-06 13:50 ` David Howells
2025-05-06 13:56 ` Christoph Hellwig
2025-05-06 18:20 ` Jakub Kicinski
2025-05-07 13:45 ` David Howells
2025-05-07 17:47 ` Willem de Bruijn
2025-05-07 13:49 ` David Howells
2025-05-12 14:51 ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59 ` David Hildenbrand
2025-06-23 11:50 ` Christian Brauner
2025-06-23 13:53 ` Christoph Hellwig
2025-06-23 14:16 ` David Howells
2025-06-23 10:50 ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46 ` Christoph Hellwig
2025-06-23 23:38 ` Alistair Popple
2025-06-24 9:02 ` David Howells
2025-06-24 12:18 ` Jason Gunthorpe
2025-06-24 12:39 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250505131446.7448e9bf@kernel.org \
--to=kuba@kernel.org \
--cc=andrew@lunn.ch \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=dhowells@redhat.com \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=jhubbard@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).