From: Jakub Kicinski <kuba@kernel.org>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Eric Dumazet <edumazet@google.com>,
"David S. Miller" <davem@davemloft.net>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Christoph Hellwig <hch@infradead.org>,
willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org,
Willem de Bruijn <willemb@google.com>
Subject: Re: Reorganising how the networking layer handles memory
Date: Mon, 5 May 2025 13:14:46 -0700 [thread overview]
Message-ID: <20250505131446.7448e9bf@kernel.org> (raw)
In-Reply-To: <1069540.1746202908@warthog.procyon.org.uk>
On Fri, 02 May 2025 17:21:48 +0100 David Howells wrote:
> Okay, perhaps I should start at the beginning :-).
Thanks :) Looks like Eric is CCed, also adding Willem.
The direction of using ubuf_info makes sense to me.
Random comments below on the little I know.
> There a number of things that are going to mandate an overhaul of how the
> networking layer handles memory:
>
> (1) The sk_buff code assumes it can take refs on pages it is given, but the
> page ref counter is going to go away in the relatively near term.
>
> Indeed, you're already not allowed to take a ref on, say, slab memory,
> because the page ref doesn't control the lifetime of the object.
>
> Even pages are going to kind of go away. Willy haz planz...
I think the part NVMe folks run into is the BPF integration layer
called skmsg. It's supposed to be a BPF-based "data router", at
the socket layer, before any protocol processing, so it tries
to do its own page ref accounting..
> (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it
> doesn't use page pinning. It needs to use the GUP routines.
We end up calling iov_iter_get_pages2(). Is it not setting
FOLL_PIN is a conscious choice, or nobody cared until now?
> (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> used with certain memory types (e.g. slab). It takes a ref on whatever
> it is given - which is wrong if it should pin this instead.
s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants
a ref to the stack, right? But yes, the networking stack will try to
release it.
> (4) iov_iter extraction will probably change to dispensing {physaddr,len}
> tuples rather than {page,off,len} tuples. The socket layer won't then
> see pages at all.
>
> (5) Memory segments splice()'d into a socket may have who-knows-what weird
> lifetime requirements.
>
> So after discussions at LSF/MM, what I'm proposing is this:
>
> (1) If we want to use zerocopy we (the kernel) have to pass a cleanup
> function to sendmsg() along with the data. If you don't care about
> zerocopy, it will copy the data.
TAL at struct ubuf_info
> (2) For each message sent with sendmsg, the cleanup function is called
> progressively as parts of the data it included are completed. I would do
> it progressively so that big messages can be handled.
>
> (3) We also pass an optional 'refill' function to sendmsg. As data is sent,
> the code that extracts the data will call this to pin more user bufs (we
> don't necessarily want to pin everything up front). The refill function
> is permitted to sleep to allow the amount of pinned memory to subside.
Why not feed the data as you get the notifications for completion?
> (4) We move a lot the zerocopy wrangling code out of the basement of the
> networking code and put it at the system call level, above the call to
> ->sendmsg() and the basement code then calls the appropriate functions to
> extract, refill and clean up. It may be usable in other contexts too -
> DIO to regular files, for example.
>
> (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
> the cleanup function.
Already the case? :)
> (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it
> refers to. This is done for it by the zerocopy layer.
next prev parent reply other threads:[~2025-05-05 20:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48 ` David Hildenbrand
2025-05-02 14:21 ` Andrew Lunn
2025-05-02 16:21 ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14 ` Jakub Kicinski [this message]
2025-05-06 13:50 ` David Howells
2025-05-06 13:56 ` Christoph Hellwig
2025-05-07 13:49 ` David Howells
2025-05-06 18:20 ` Jakub Kicinski
2025-05-07 13:45 ` David Howells
2025-05-07 17:47 ` Willem de Bruijn
2025-05-12 14:51 ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59 ` David Hildenbrand
2025-06-23 10:50 ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46 ` Christoph Hellwig
2025-06-23 23:38 ` Alistair Popple
2025-06-24 9:02 ` David Howells
2025-06-24 12:18 ` Jason Gunthorpe
2025-06-24 12:39 ` Christoph Hellwig
2025-06-23 11:50 ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN Christian Brauner
2025-06-23 13:53 ` Christoph Hellwig
2025-06-23 14:16 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250505131446.7448e9bf@kernel.org \
--to=kuba@kernel.org \
--cc=andrew@lunn.ch \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=dhowells@redhat.com \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=jhubbard@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.