netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Eric Dumazet <edumazet@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Christoph Hellwig <hch@infradead.org>,
	willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org,
	Willem de Bruijn <willemb@google.com>
Subject: Re: Reorganising how the networking layer handles memory
Date: Mon, 5 May 2025 13:14:46 -0700	[thread overview]
Message-ID: <20250505131446.7448e9bf@kernel.org> (raw)
In-Reply-To: <1069540.1746202908@warthog.procyon.org.uk>

On Fri, 02 May 2025 17:21:48 +0100 David Howells wrote:
> Okay, perhaps I should start at the beginning :-).

Thanks :) Looks like Eric is CCed, also adding Willem.
The direction of using ubuf_info makes sense to me.
Random comments below on the little I know.

> There a number of things that are going to mandate an overhaul of how the
> networking layer handles memory:
> 
>  (1) The sk_buff code assumes it can take refs on pages it is given, but the
>      page ref counter is going to go away in the relatively near term.
> 
>      Indeed, you're already not allowed to take a ref on, say, slab memory,
>      because the page ref doesn't control the lifetime of the object.
> 
>      Even pages are going to kind of go away.  Willy haz planz...

I think the part NVMe folks run into is the BPF integration layer
called skmsg. It's supposed to be a BPF-based "data router", at
the socket layer, before any protocol processing, so it tries
to do its own page ref accounting..

>  (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it
>      doesn't use page pinning.  It needs to use the GUP routines.

We end up calling iov_iter_get_pages2(). Is it not setting
FOLL_PIN is a conscious choice, or nobody cared until now?

>  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
>      used with certain memory types (e.g. slab).  It takes a ref on whatever
>      it is given - which is wrong if it should pin this instead.

s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
a ref  to the stack, right? But yes, the networking stack will try to
release it.

>  (4) iov_iter extraction will probably change to dispensing {physaddr,len}
>      tuples rather than {page,off,len} tuples.  The socket layer won't then
>      see pages at all.
>
>  (5) Memory segments splice()'d into a socket may have who-knows-what weird
>      lifetime requirements.
> 
> So after discussions at LSF/MM, what I'm proposing is this:
> 
>  (1) If we want to use zerocopy we (the kernel) have to pass a cleanup
>      function to sendmsg() along with the data.  If you don't care about
>      zerocopy, it will copy the data.

TAL at struct ubuf_info

>  (2) For each message sent with sendmsg, the cleanup function is called
>      progressively as parts of the data it included are completed.  I would do
>      it progressively so that big messages can be handled.
> 
>  (3) We also pass an optional 'refill' function to sendmsg.  As data is sent,
>      the code that extracts the data will call this to pin more user bufs (we
>      don't necessarily want to pin everything up front).  The refill function
>      is permitted to sleep to allow the amount of pinned memory to subside.

Why not feed the data as you get the notifications for completion?

>  (4) We move a lot the zerocopy wrangling code out of the basement of the
>      networking code and put it at the system call level, above the call to
>      ->sendmsg() and the basement code then calls the appropriate functions to  
>      extract, refill and clean up.  It may be usable in other contexts too -
>      DIO to regular files, for example.
> 
>  (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
>      the cleanup function.

Already the case? :)

>  (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it
>      refers to.  This is done for it by the zerocopy layer.

  reply	other threads:[~2025-05-05 20:14 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48   ` David Hildenbrand
2025-05-02 14:21   ` Andrew Lunn
2025-05-02 16:21   ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14     ` Jakub Kicinski [this message]
2025-05-06 13:50     ` David Howells
2025-05-06 13:56       ` Christoph Hellwig
2025-05-06 18:20       ` Jakub Kicinski
2025-05-07 13:45       ` David Howells
2025-05-07 17:47         ` Willem de Bruijn
2025-05-07 13:49       ` David Howells
2025-05-12 14:51   ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59     ` David Hildenbrand
2025-06-23 11:50     ` Christian Brauner
2025-06-23 13:53     ` Christoph Hellwig
2025-06-23 14:16     ` David Howells
2025-06-23 10:50   ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46     ` Christoph Hellwig
2025-06-23 23:38       ` Alistair Popple
2025-06-24  9:02     ` David Howells
2025-06-24 12:18       ` Jason Gunthorpe
2025-06-24 12:39       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250505131446.7448e9bf@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).