From: Stanislav Fomichev <stfomichev@gmail.com>
To: David Howells <dhowells@redhat.com>
Cc: Mina Almasry <almasrymina@google.com>,
willy@infradead.org, hch@infradead.org,
Jakub Kicinski <kuba@kernel.org>,
Eric Dumazet <edumazet@google.com>,
netdev@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: Device mem changes vs pinning/zerocopy changes
Date: Fri, 30 May 2025 08:50:10 -0700 [thread overview]
Message-ID: <aDnTsvbyKCTkZbOR@mini-arch> (raw)
In-Reply-To: <770012.1748618092@warthog.procyon.org.uk>
On 05/30, David Howells wrote:
> Hi Mina,
>
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do. I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
>
> So, to summarise what we need to change (you may already know all of this):
>
> (*) The refcount in struct page is going to go away. The sk_buff fragment
> wrangling code, however, occasionally decides to override the zerocopy
> mode and grab refs on the pages pointed to by those fragments. sk_buffs
> *really* want those page refs - and it does simplify memory handling.
> But.
>
> Anyway, we need to stop taking refs where possible. A fragment may in
> future point to a sequence of pages and we would only be getting a ref on
> one of them.
>
> (*) Further, the page struct is intended to be slimmed down to a single typed
> pointer if possible, so all the metadata in the net_iov struct will have
> to be separately allocated.
>
> (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
> pages specified by the iterator but we need to stop doing that. We need
> to call GUP to take a "pin" instead (and must not take any refs). The
> pages we get access to may be folio-type, anon-type, some sort of device
> type.
>
> (*) It would be good to do a batch lookup of user buffers to cut down on the
> number of page table trawls we do - but, on the other hand, that might
> generate more page faults upfront.
>
> (*) Splice and vmsplice. If only I could uninvent them... Anyway, they give
> us buffers from a pipe - but the buffers come with destructors and should
> not have refs taken on the pages we might think they have, but use the
> destructor instead.
>
> (*) The intention is to change struct bio_vec to be just physical address and
> length, with no page pointer. You'd then use, say, kmap_local_phys() or
> kmap_local_bvec() to access the contents from the cpu. We could then
> revert the fragment pointers to being bio_vecs.
>
> (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
> to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
> any case, the object lifetime is not managed by refcount. However, if we
> had a destructor, this restriction could go away.
>
>
> So what I'd like to do is:
[..]
> (1) Separate fragment lifetime management from sk_buff. No more wangling of
> refcounts in the skbuff code. If you clone an skb, you stick an extra
> ref on the lifetime management struct, not the page.
For device memory TCP we already have this: net_devmem_dmabuf_binding
is the owner of the frags. And when we reference skb frag we reference
only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
net_devmem_get_net_iov (ref on the binding).
Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
cases? From what I can tell, this is somewhat equivalent of your net_txbuf.
next prev parent reply other threads:[~2025-05-30 15:50 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
2025-05-30 15:50 ` Stanislav Fomichev [this message]
2025-05-30 16:22 ` Mina Almasry
2025-06-04 14:56 ` David Howells
2025-06-05 18:59 ` Mina Almasry
2025-06-04 15:34 ` David Howells
2025-06-05 19:27 ` Mina Almasry
2025-06-04 15:59 ` David Howells
2025-06-05 19:30 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aDnTsvbyKCTkZbOR@mini-arch \
--to=stfomichev@gmail.com \
--cc=almasrymina@google.com \
--cc=dhowells@redhat.com \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).