Re: Device mem changes vs pinning/zerocopy changes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Stanislav Fomichev <stfomichev@gmail.com>
To: David Howells <dhowells@redhat.com>
Cc: Mina Almasry <almasrymina@google.com>,
	willy@infradead.org, hch@infradead.org,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: Device mem changes vs pinning/zerocopy changes
Date: Fri, 30 May 2025 08:50:10 -0700	[thread overview]
Message-ID: <aDnTsvbyKCTkZbOR@mini-arch> (raw)
In-Reply-To: <770012.1748618092@warthog.procyon.org.uk>

On 05/30, David Howells wrote:
> Hi Mina,
> 
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do.  I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
> 
> So, to summarise what we need to change (you may already know all of this):
> 
>  (*) The refcount in struct page is going to go away.  The sk_buff fragment
>      wrangling code, however, occasionally decides to override the zerocopy
>      mode and grab refs on the pages pointed to by those fragments.  sk_buffs
>      *really* want those page refs - and it does simplify memory handling.
>      But.
> 
>      Anyway, we need to stop taking refs where possible.  A fragment may in
>      future point to a sequence of pages and we would only be getting a ref on
>      one of them.
> 
>  (*) Further, the page struct is intended to be slimmed down to a single typed
>      pointer if possible, so all the metadata in the net_iov struct will have
>      to be separately allocated.
> 
>  (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
>      pages specified by the iterator but we need to stop doing that.  We need
>      to call GUP to take a "pin" instead (and must not take any refs).  The
>      pages we get access to may be folio-type, anon-type, some sort of device
>      type.
> 
>  (*) It would be good to do a batch lookup of user buffers to cut down on the
>      number of page table trawls we do - but, on the other hand, that might
>      generate more page faults upfront.
> 
>  (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
>      us buffers from a pipe - but the buffers come with destructors and should
>      not have refs taken on the pages we might think they have, but use the
>      destructor instead.
> 
>  (*) The intention is to change struct bio_vec to be just physical address and
>      length, with no page pointer.  You'd then use, say, kmap_local_phys() or
>      kmap_local_bvec() to access the contents from the cpu.  We could then
>      revert the fragment pointers to being bio_vecs.
> 
>  (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
>      to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
>      any case, the object lifetime is not managed by refcount.  However, if we
>      had a destructor, this restriction could go away.
> 
> 
> So what I'd like to do is:

[..]

>  (1) Separate fragment lifetime management from sk_buff.  No more wangling of
>      refcounts in the skbuff code.  If you clone an skb, you stick an extra
>      ref on the lifetime management struct, not the page.

For device memory TCP we already have this: net_devmem_dmabuf_binding
is the owner of the frags. And when we reference skb frag we reference
only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
net_devmem_get_net_iov (ref on the binding).

Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
cases? From what I can tell, this is somewhat equivalent of your net_txbuf.

next prev parent reply	other threads:[~2025-05-30 15:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
2025-05-30 15:50 ` Stanislav Fomichev [this message]
2025-05-30 16:22 ` Mina Almasry
2025-06-04 14:56 ` David Howells
2025-06-05 18:59   ` Mina Almasry
2025-06-04 15:34 ` David Howells
2025-06-05 19:27   ` Mina Almasry
2025-06-04 15:59 ` David Howells
2025-06-05 19:30   ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aDnTsvbyKCTkZbOR@mini-arch \
    --to=stfomichev@gmail.com \
    --cc=almasrymina@google.com \
    --cc=dhowells@redhat.com \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).