All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: David Howells <dhowells@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>, Andrew Lunn <andrew@lunn.ch>,
	Eric Dumazet <edumazet@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Christoph Hellwig <hch@infradead.org>,
	willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org,
	Willem de Bruijn <willemb@google.com>
Subject: Re: Reorganising how the networking layer handles memory
Date: Tue, 6 May 2025 06:56:17 -0700	[thread overview]
Message-ID: <aBoVAd-XX_44RKbC@infradead.org> (raw)
In-Reply-To: <1216273.1746539449@warthog.procyon.org.uk>

On Tue, May 06, 2025 at 02:50:49PM +0100, David Howells wrote:
> > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > >      it doesn't use page pinning.  It needs to use the GUP routines.
> > 
> > We end up calling iov_iter_get_pages2(). Is it not setting
> > FOLL_PIN is a conscious choice, or nobody cared until now?
> 
> iov_iter_get_pages*() predates GUP, I think.

It predates pin_user_pages, but get_user_pages is much older.

> There's now an
> iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> different cleanup, which is why I created a new API call.

But yes, iov_iter_get_pages* needs to go away in favour of
iov_iter_extract_pages, and I'm still annoyed that despite multiple
pings no one has done any work on that outside of block / block based
direct I/O and netfs.

> > >  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> > >      used with certain memory types (e.g. slab).  It takes a ref on whatever
> > >      it is given - which is wrong if it should pin this instead.
> > 
> > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
> > a ref  to the stack, right? But yes, the networking stack will try to
> > release it.
> 
> I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
> needs to be changed.
> 
> Christoph Hellwig would like to make it such that the extractor gets
> {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
> is that you've got a span of memory to use as your buffer.  How that span of
> memory is managed is the responsibility of whoever called sendmsg() - and they
> need a callback to be able to handle that.

Not sure what the extractor is, but we plan to change the bio_vec
to be physical address instead of page+offset based.  Where we is
a lot more people than just me.

> Once advantage of delegating it to the caller, though, and having the caller
> keep track of which bits in still needs to hold on to by transmission
> completion position is that we don't need to manage refs/pins across sk_buff
> duplication - let alone what we should do with stuff that's kmalloc'd.

And the callers already do that for all other kinds of I/O anyway.


  reply	other threads:[~2025-05-06 13:56 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-02 12:07 How much is checksumming done in the kernel vs on the NIC? David Howells
2025-05-02 13:09 ` Andrew Lunn
2025-05-02 13:41   ` MSG_ZEROCOPY and the O_DIRECT vs fork() race David Howells
2025-05-02 13:48     ` David Hildenbrand
2025-05-02 14:21     ` Andrew Lunn
2025-05-02 16:21       ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14         ` Jakub Kicinski
2025-05-06 13:50           ` David Howells
2025-05-06 13:56             ` Christoph Hellwig [this message]
2025-05-07 13:49               ` David Howells
2025-05-06 18:20             ` Jakub Kicinski
2025-05-07 13:45               ` David Howells
2025-05-07 17:47                 ` Willem de Bruijn
2025-05-12 14:51         ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59           ` David Hildenbrand
2025-06-23 10:50           ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46             ` Christoph Hellwig
2025-06-23 23:38               ` Alistair Popple
2025-06-24  9:02               ` David Howells
2025-06-24 12:18                 ` Jason Gunthorpe
2025-06-24 12:39                 ` Christoph Hellwig
2025-06-23 11:50           ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN Christian Brauner
2025-06-23 13:53           ` Christoph Hellwig
2025-06-23 14:16             ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBoVAd-XX_44RKbC@infradead.org \
    --to=hch@infradead.org \
    --cc=andrew@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=edumazet@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.