From: Jakub Kicinski <kuba@kernel.org>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Eric Dumazet <edumazet@google.com>,
"David S. Miller" <davem@davemloft.net>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Christoph Hellwig <hch@infradead.org>,
willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org,
Willem de Bruijn <willemb@google.com>
Subject: Re: Reorganising how the networking layer handles memory
Date: Tue, 6 May 2025 11:20:12 -0700 [thread overview]
Message-ID: <20250506112012.5779d652@kernel.org> (raw)
In-Reply-To: <1216273.1746539449@warthog.procyon.org.uk>
On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > > it doesn't use page pinning. It needs to use the GUP routines.
> >
> > We end up calling iov_iter_get_pages2(). Is it not setting
> > FOLL_PIN is a conscious choice, or nobody cared until now?
>
> iov_iter_get_pages*() predates GUP, I think. There's now an
> iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> different cleanup, which is why I created a new API call.
>
> iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> non-user iterator (e.g. ITER_BVEC).
FWIW it occurred to me after hitting send that we may not care.
We're talking about Tx, so the user pages are read only for the kernel.
I don't think we have the "child gets the read data" problem?
> > > (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> > > used with certain memory types (e.g. slab). It takes a ref on whatever
> > > it is given - which is wrong if it should pin this instead.
> >
> > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants
> > a ref to the stack, right? But yes, the networking stack will try to
> > release it.
>
> I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
> needs to be changed.
>
> Christoph Hellwig would like to make it such that the extractor gets
> {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
> is that you've got a span of memory to use as your buffer. How that span of
> memory is managed is the responsibility of whoever called sendmsg() - and they
> need a callback to be able to handle that.
Sure, there may be things to iron out as data in networking is not
opaque. We need to handle the firewalling and inspection cases.
Likely all this will work well for ZC but not sure if we can "convert"
the stack to phyaddr+len.
> > TAL at struct ubuf_info
>
> I've looked at it, yes, however, I'm wondering if we can make it more generic
> and usable by regular file DIO and splice also.
Okay, just keep in mind that we are working on 800Gbps NIC support these
days, and MTU does not grow. So whatever we do - it must be fast fast.
> Further, we need a way to track pages we've pinned. One way to do that is to
> simply rely on the sk_buff fragment array and keep track of which particular
> bits need putting/unpinning/freeing/kfreeing/etc - but really that should be
> handled by the caller unless it costs too much performance (which it might).
>
> Once advantage of delegating it to the caller, though, and having the caller
> keep track of which bits in still needs to hold on to by transmission
> completion position is that we don't need to manage refs/pins across sk_buff
> duplication - let alone what we should do with stuff that's kmalloc'd.
>
> > > (3) We also pass an optional 'refill' function to sendmsg. As data is
> > > sent, the code that extracts the data will call this to pin more user
> > > bufs (we don't necessarily want to pin everything up front). The
> > > refill function is permitted to sleep to allow the amount of pinned
> > > memory to subside.
> >
> > Why not feed the data as you get the notifications for completion?
>
> Because there are multiple factors that govern the size of the chunks in which
> the refilling is done:
>
> (1) We want to get user pages in batches to reduce the cost of the
> synchronisation MM has to do. Further, the individual spans in the
> batches will be of variable size (folios can be of different sizes, for
> example). The idea of the 'refill' is that we go and refill as each
> batch is transcribed into skbuffs.
>
> (2) We don't want to run extraction too far ahead as that will delay the
> onset of transmission.
>
> (3) We don't want to pin too much at any one time as that builds memory
> pressure and in the worst case will cause OOM conditions.
>
> So we need to balance things - particularly (1) and (2) - and accept that we
> may get multiple refils in order to fill the socket transmission buffer.
Hard to comment without concrete workload at hand.
Ideally the interface would be good enough for the application
to dependably drive the transmission in an efficient way.
next prev parent reply other threads:[~2025-05-06 18:20 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch>
[not found] ` <1015189.1746187621@warthog.procyon.org.uk>
2025-05-02 13:41 ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48 ` David Hildenbrand
2025-05-02 14:21 ` Andrew Lunn
2025-05-02 16:21 ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14 ` Jakub Kicinski
2025-05-06 13:50 ` David Howells
2025-05-06 13:56 ` Christoph Hellwig
2025-05-06 18:20 ` Jakub Kicinski [this message]
2025-05-07 13:45 ` David Howells
2025-05-07 17:47 ` Willem de Bruijn
2025-05-07 13:49 ` David Howells
2025-05-12 14:51 ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59 ` David Hildenbrand
2025-06-23 11:50 ` Christian Brauner
2025-06-23 13:53 ` Christoph Hellwig
2025-06-23 14:16 ` David Howells
2025-06-23 10:50 ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46 ` Christoph Hellwig
2025-06-23 23:38 ` Alistair Popple
2025-06-24 9:02 ` David Howells
2025-06-24 12:18 ` Jason Gunthorpe
2025-06-24 12:39 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250506112012.5779d652@kernel.org \
--to=kuba@kernel.org \
--cc=andrew@lunn.ch \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=dhowells@redhat.com \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=jhubbard@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).