Device mem changes vs pinning/zerocopy changes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Device mem changes vs pinning/zerocopy changes
@ 2025-05-30 15:14 David Howells
  2025-05-30 15:50 ` Stanislav Fomichev
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: David Howells @ 2025-05-30 15:14 UTC (permalink / raw)
  To: Mina Almasry
  Cc: dhowells, willy, hch, Jakub Kicinski, Eric Dumazet, netdev,
	linux-mm, linux-kernel

Hi Mina,

I've seen your transmission-side TCP devicemem stuff has just gone in and it
conflicts somewhat with what I'm trying to do.  I think you're working on the
problem bottom up and I'm working on it top down, so if you're willing to
collaborate on it...?

So, to summarise what we need to change (you may already know all of this):

 (*) The refcount in struct page is going to go away.  The sk_buff fragment
     wrangling code, however, occasionally decides to override the zerocopy
     mode and grab refs on the pages pointed to by those fragments.  sk_buffs
     *really* want those page refs - and it does simplify memory handling.
     But.

     Anyway, we need to stop taking refs where possible.  A fragment may in
     future point to a sequence of pages and we would only be getting a ref on
     one of them.

 (*) Further, the page struct is intended to be slimmed down to a single typed
     pointer if possible, so all the metadata in the net_iov struct will have
     to be separately allocated.

 (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
     pages specified by the iterator but we need to stop doing that.  We need
     to call GUP to take a "pin" instead (and must not take any refs).  The
     pages we get access to may be folio-type, anon-type, some sort of device
     type.

 (*) It would be good to do a batch lookup of user buffers to cut down on the
     number of page table trawls we do - but, on the other hand, that might
     generate more page faults upfront.

 (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
     us buffers from a pipe - but the buffers come with destructors and should
     not have refs taken on the pages we might think they have, but use the
     destructor instead.

 (*) The intention is to change struct bio_vec to be just physical address and
     length, with no page pointer.  You'd then use, say, kmap_local_phys() or
     kmap_local_bvec() to access the contents from the cpu.  We could then
     revert the fragment pointers to being bio_vecs.

 (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
     to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
     any case, the object lifetime is not managed by refcount.  However, if we
     had a destructor, this restriction could go away.


So what I'd like to do is:

 (1) Separate fragment lifetime management from sk_buff.  No more wangling of
     refcounts in the skbuff code.  If you clone an skb, you stick an extra
     ref on the lifetime management struct, not the page.

 (2) Create a chainable 'network buffer' struct, e.g.:

	enum net_txbuf_type {
		NET_TXBUF_BUFFERED,	/* Buffered copy of data */
		NET_TXBUF_ZCOPY_USER,	/* Zerocopy of user buffers */
		NET_TXBUF_ZCOPY_KERNEL,	/* Zerocopy of kernel buffers */
	};

	struct net_txbuf {
		struct net_txbuf	next;
		struct mmpin		mm_pin;
		unsigned int		start_pos;
		unsigned int		end_pos;
		unsigned int		extracted_to;
		refcount_t		ref;
		enum net_txbuf_type	type;
		u8			nr_used;
		bool			wmem_charged;
		bool			got_copied;
		union {
			/* For NET_TXBUF_BUFFERED: */
			struct {
				void		*bufs[16];
				u8		bufs_orders[16];
				bool		last_buf_freeable;
			};
			/* For NET_TXBUF_ZCOPY_*: */
			struct {
				struct sock	*sk;
				struct sk_buff	*notify;
				msg_completion_t completion;
				void		*completion_data;
				struct bio_vec	frags[12];
			};
		};
	};

     (Note this is very much still a WiP and very much subject to change)

     So how I envision it working depends on the type of flow in the socket.
     For the transmission side of streaming sockets (e.g. TCP), the socket
     maintains a single chain of these.  Each txbuf is of a single type, but
     multiple types can be interleaved.

     For non-ZC flow, as data is imported, it's copied into pages attached to
     the current head txbuf of type BUFFERED, with more pages being attached
     as we progress.  Successive writes just keep adding to the space in the
     latest page added and each skbuff generated pins the txbuf it starts at
     and each txbuf pins its successor.

     As skbuffs are consumed, they unpin the root txbuf.  However, this could
     leave an awful lot of memory pinned for a long time, so I would mitigate
     this in two ways: firstly, where possible, keep track of the transmitted
     byte position and progressively destruct the txbuf; secondly, if we
     completely use up a partially filled txbuf then reset the queue.

     An skbuff's frag list then has a bio_vec[] that refers to fragments of
     the buffers recorded in the txbuf chain.  An skbuff may span multiple
     txbufs and a txbuf may provision multiple skbuffs.

     For the transmission side of datagram sockets (e.g. UDP) where the
     messages may complete out of order, I think I would give each datagram
     its own series of txbufs, but link the tails together to manage the
     SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with userspace.
     If dealing with the kernel, there's no need to link them together as the
     kernel can provide a destructor for each datagram.

 (3) When doing zerocopy from userspace, do calls to GUP to get batches of
     non-contiguous pages into a bio_vec array.

 (4) Because AF_UNIX and the loopback driver transfer packets from the
     transmission queue of one socket down into the reception queue of
     another, the use of txbufs would also need to extend onto the receive
     side (and so "txbufs" would be a misnomer).

     When receiving a packet, a txbuf would need to be allocated and the
     received buffers attached to it.  The pages wouldn't necessarily need
     refcounts as the txbuf holds them.  The skbuff holds a ref on the txbuf.

 (5) Cloning an skbuff would involve just taking an extra ref on the first
     txbuf.  Splitting off part of an skbuff would involve fast-forwarding the
     txbuf chain for the second part and pinning that.

 (6) I have a chained-bio_vec array concept with iov_iter type for it that
     might make it easier to string together the fragments in a reassembled
     packet and represent it as an iov_iter, thereby allowing us to use common
     iterator routines for things like ICMP and packet crypto.

 (7) We need to separate net_iov from struct page, and it might make things
     easier if we do that now, allocating net_iov from a slab.

 (8) Reference the txbuf in a splice and provide a destructor that drops that
     reference.  For small splices, I'd be very tempted to simply copy the
     data.  For splice-out of data that was spliced into an AF_UNIX socket or
     zerocopy data that passed through a loopback device, I'm also very
     tempted to make splice copy at that point.  There's a potential DoS
     attack whereby someone can endlessly splice tiny bits of a message or
     just sit on them, preventing the original provider from recovering its
     memory.

 (9) Make it easy for a network filesystem to create an entire compound
     message and present it to the socket in a single sendmsg() with a
     destructor.

I've pushed my current changes (very incomplete as they are) to:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental

I'm writing functions to abstract out the loading of data into the txbuf chain
and attach to skbuff.  These can be found in skbuff.c as net_txbuf_*().  I've
modified the TCP sendmsg to use them.

David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
@ 2025-05-30 15:50 ` Stanislav Fomichev
  2025-05-30 16:22 ` Mina Almasry
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Stanislav Fomichev @ 2025-05-30 15:50 UTC (permalink / raw)
  To: David Howells
  Cc: Mina Almasry, willy, hch, Jakub Kicinski, Eric Dumazet, netdev,
	linux-mm, linux-kernel

On 05/30, David Howells wrote:
> Hi Mina,
> 
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do.  I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
> 
> So, to summarise what we need to change (you may already know all of this):
> 
>  (*) The refcount in struct page is going to go away.  The sk_buff fragment
>      wrangling code, however, occasionally decides to override the zerocopy
>      mode and grab refs on the pages pointed to by those fragments.  sk_buffs
>      *really* want those page refs - and it does simplify memory handling.
>      But.
> 
>      Anyway, we need to stop taking refs where possible.  A fragment may in
>      future point to a sequence of pages and we would only be getting a ref on
>      one of them.
> 
>  (*) Further, the page struct is intended to be slimmed down to a single typed
>      pointer if possible, so all the metadata in the net_iov struct will have
>      to be separately allocated.
> 
>  (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
>      pages specified by the iterator but we need to stop doing that.  We need
>      to call GUP to take a "pin" instead (and must not take any refs).  The
>      pages we get access to may be folio-type, anon-type, some sort of device
>      type.
> 
>  (*) It would be good to do a batch lookup of user buffers to cut down on the
>      number of page table trawls we do - but, on the other hand, that might
>      generate more page faults upfront.
> 
>  (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
>      us buffers from a pipe - but the buffers come with destructors and should
>      not have refs taken on the pages we might think they have, but use the
>      destructor instead.
> 
>  (*) The intention is to change struct bio_vec to be just physical address and
>      length, with no page pointer.  You'd then use, say, kmap_local_phys() or
>      kmap_local_bvec() to access the contents from the cpu.  We could then
>      revert the fragment pointers to being bio_vecs.
> 
>  (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
>      to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
>      any case, the object lifetime is not managed by refcount.  However, if we
>      had a destructor, this restriction could go away.
> 
> 
> So what I'd like to do is:

[..]

>  (1) Separate fragment lifetime management from sk_buff.  No more wangling of
>      refcounts in the skbuff code.  If you clone an skb, you stick an extra
>      ref on the lifetime management struct, not the page.

For device memory TCP we already have this: net_devmem_dmabuf_binding
is the owner of the frags. And when we reference skb frag we reference
only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
net_devmem_get_net_iov (ref on the binding).

Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
cases? From what I can tell, this is somewhat equivalent of your net_txbuf.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
  2025-05-30 15:50 ` Stanislav Fomichev
@ 2025-05-30 16:22 ` Mina Almasry
  2025-06-04 14:56 ` David Howells
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Mina Almasry @ 2025-05-30 16:22 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, Jakub Kicinski, Eric Dumazet, netdev, linux-mm,
	linux-kernel

On Fri, May 30, 2025 at 8:15 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi Mina,
>
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do.  I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
>

Hi David! Yes, very happy to collaborate. FWIW, my initial gut feeling
is that the work doesn't conflict that much. The tcp devmem
netmem/net_iov stuff is designed to follow the page stuff, and as the
usage of struct page changes we're happy moving net_iovs and netmems
to do the same thing. My read is that it will take a small amount of
extra work, but there are no in-principle design conflicts, at least
AFAICT so far.

> So, to summarise what we need to change (you may already know all of this):
>
>  (*) The refcount in struct page is going to go away.  The sk_buff fragment
>      wrangling code, however, occasionally decides to override the zerocopy
>      mode and grab refs on the pages pointed to by those fragments.  sk_buffs
>      *really* want those page refs - and it does simplify memory handling.
>      But.
>

I believe the main challenge here is that there are many code paths in
the net stack that expect to be able to grab a ref on skb frags. See
all the callers of skb_frag_ref/skb_frag_unref:

tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone,
pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header,
pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment.

I think to accomplish what you're describing we need to modify
skb_frag_ref to do something else other than taking a reference on the
page or net_iov. I think maybe taking a reference on the skb itself
may be acceptable, and the skb can 'guarantee' that the individual
frags underneath it don't disappear while these functions are
executing.

But devmem TCP doesn't get much in the way here, AFAICT. It's really
the fact that so many of the networking code paths want to obtain page
refs via skb_frag_ref so there is potentially a lot of code to touch.

>      Anyway, we need to stop taking refs where possible.  A fragment may in
>      future point to a sequence of pages and we would only be getting a ref on
>      one of them.
>
>  (*) Further, the page struct is intended to be slimmed down to a single typed
>      pointer if possible, so all the metadata in the net_iov struct will have
>      to be separately allocated.
>

Yes, I'm already collaborating with Byungchul on this, and we're
making great progress. I think this may be close to getting fully
reviewed:

https://lore.kernel.org/netdev/20250509115126.63190-1-byungchul@sk.com/

According to his cover letter, he's actually finding the work that I
did with netmem/net_iovs useful for him.

>  (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
>      pages specified by the iterator but we need to stop doing that.  We need
>      to call GUP to take a "pin" instead (and must not take any refs).  The
>      pages we get access to may be folio-type, anon-type, some sort of device
>      type.
>

This also doesn't conflict with devmem changes, AFAICT. Currently in
devmem we take references on the user devmem only because pages also
take a reference on the user pages (we mirror devmem and pages to keep
the netstack uniform without excessive mem-type checks). If the code
path for  zerocopy_fill_skb_from_iter doesn't need to take ref on the
pages, we'll just migrate  zerocopy_fill_skb_from_devmem to do the
same thing.

>  (*) It would be good to do a batch lookup of user buffers to cut down on the
>      number of page table trawls we do - but, on the other hand, that might
>      generate more page faults upfront.
>
>  (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
>      us buffers from a pipe - but the buffers come with destructors and should
>      not have refs taken on the pages we might think they have, but use the
>      destructor instead.
>

The above 2 points are orthogonal to devmem. There is no page table
walks, splice, or vmsplice with devmem. We just have to make sure that
the generic changes you're implementing also work with the devmem
paths. I can test your changes and point if I see any issue, but I
don't see a conflict in-principle.

>  (*) The intention is to change struct bio_vec to be just physical address and
>      length, with no page pointer.  You'd then use, say, kmap_local_phys() or
>      kmap_local_bvec() to access the contents from the cpu.  We could then
>      revert the fragment pointers to being bio_vecs.
>

Ok, this part conflicts a bit, maybe. skb_frag_ref no longer uses
struct bio_vec. I merged that change before 6.12 kernel, it's not a
very recent change:

https://lore.kernel.org/netdev/20240214223405.1972973-3-almasrymina@google.com/

But, AFAICT, skb_frag_t needs a struct page inside of it, not just a
physical address. skb_frags can mmap'd into userspace for TCP
zerocopy, see tcp_zerocopy_vm_insert_batch (which is a very old
feature, it's not a recent change). There may be other call paths in
the net stack that require a full page and just a physical address
will do. (unless somehow we can mmap a physical address to the
userspace).

>  (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
>      to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
>      any case, the object lifetime is not managed by refcount.  However, if we
>      had a destructor, this restriction could go away.
>

This also sounds orthogonal to devmem.

>
> So what I'd like to do is:
>
>  (1) Separate fragment lifetime management from sk_buff.  No more wangling of
>      refcounts in the skbuff code.  If you clone an skb, you stick an extra
>      ref on the lifetime management struct, not the page.
>

Agreed, and AFAICT devmem doesn't get in the way. Let me know if you
disagree or are seeing something different.

>  (2) Create a chainable 'network buffer' struct, e.g.:
>
>         enum net_txbuf_type {
>                 NET_TXBUF_BUFFERED,     /* Buffered copy of data */
>                 NET_TXBUF_ZCOPY_USER,   /* Zerocopy of user buffers */
>                 NET_TXBUF_ZCOPY_KERNEL, /* Zerocopy of kernel buffers */
>         };
>
>         struct net_txbuf {
>                 struct net_txbuf        next;
>                 struct mmpin            mm_pin;
>                 unsigned int            start_pos;
>                 unsigned int            end_pos;
>                 unsigned int            extracted_to;
>                 refcount_t              ref;
>                 enum net_txbuf_type     type;
>                 u8                      nr_used;
>                 bool                    wmem_charged;
>                 bool                    got_copied;
>                 union {
>                         /* For NET_TXBUF_BUFFERED: */
>                         struct {
>                                 void            *bufs[16];
>                                 u8              bufs_orders[16];
>                                 bool            last_buf_freeable;
>                         };
>                         /* For NET_TXBUF_ZCOPY_*: */
>                         struct {
>                                 struct sock     *sk;
>                                 struct sk_buff  *notify;
>                                 msg_completion_t completion;
>                                 void            *completion_data;
>                                 struct bio_vec  frags[12];
>                         };
>                 };
>         };
>
>      (Note this is very much still a WiP and very much subject to change)
>
>      So how I envision it working depends on the type of flow in the socket.
>      For the transmission side of streaming sockets (e.g. TCP), the socket
>      maintains a single chain of these.  Each txbuf is of a single type, but
>      multiple types can be interleaved.
>
>      For non-ZC flow, as data is imported, it's copied into pages attached to
>      the current head txbuf of type BUFFERED, with more pages being attached
>      as we progress.  Successive writes just keep adding to the space in the
>      latest page added and each skbuff generated pins the txbuf it starts at
>      and each txbuf pins its successor.
>
>      As skbuffs are consumed, they unpin the root txbuf.  However, this could
>      leave an awful lot of memory pinned for a long time, so I would mitigate
>      this in two ways: firstly, where possible, keep track of the transmitted
>      byte position and progressively destruct the txbuf; secondly, if we
>      completely use up a partially filled txbuf then reset the queue.
>
>      An skbuff's frag list then has a bio_vec[] that refers to fragments of
>      the buffers recorded in the txbuf chain.  An skbuff may span multiple
>      txbufs and a txbuf may provision multiple skbuffs.
>
>      For the transmission side of datagram sockets (e.g. UDP) where the
>      messages may complete out of order, I think I would give each datagram
>      its own series of txbufs, but link the tails together to manage the
>      SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with userspace.
>      If dealing with the kernel, there's no need to link them together as the
>      kernel can provide a destructor for each datagram.
>

To be honest, I didn't follow the entirety of point #2 here, but the
problem is not related to devmem.

Is struct net_txbuf intended to replace struct sk_buff in the tx path
only? If so, I'm not sure that works. Currently TX and RX memory share
a single data structure (sk_buff), and I believe that is critical.
Because RX skbs can be forwarded to TX and vice-versa. You will need
to implement an net_txbuf_to_skb helper and vice versa if you go this
route, no? So I think, maybe, instead of introducing a new struct, you
have to make the modifications you envision to struct sk_buff itself?

 >  (3) When doing zerocopy from userspace, do calls to GUP to get batches of
>      non-contiguous pages into a bio_vec array.
>
>  (4) Because AF_UNIX and the loopback driver transfer packets from the
>      transmission queue of one socket down into the reception queue of
>      another, the use of txbufs would also need to extend onto the receive
>      side (and so "txbufs" would be a misnomer).
>

OK, you realize that TX packets can be forwarded to RX. The opposite
is true, RX can be forwarded to TX. And it's not just AF_UNIX and
loopback. Packets can be forwarded via ip forwarding, and tc, and
probably another half dozen features in the net stack I don't know
about. I think you need to modify the existing sk_buff. I think adding
a new struct and migrating the entire net stack to use that is a bit
too ambitious. But up to you. Just my 2 cents here.

>      When receiving a packet, a txbuf would need to be allocated and the
>      received buffers attached to it.  The pages wouldn't necessarily need
>      refcounts as the txbuf holds them.  The skbuff holds a ref on the txbuf.
>
>  (5) Cloning an skbuff would involve just taking an extra ref on the first
>      txbuf.  Splitting off part of an skbuff would involve fast-forwarding the
>      txbuf chain for the second part and pinning that.
>
>  (6) I have a chained-bio_vec array concept with iov_iter type for it that
>      might make it easier to string together the fragments in a reassembled
>      packet and represent it as an iov_iter, thereby allowing us to use common
>      iterator routines for things like ICMP and packet crypto.
>
>  (7) We need to separate net_iov from struct page, and it might make things
>      easier if we do that now, allocating net_iov from a slab.
>

net_iov is already separate from struct page. The only commonality
they share is that we static_assert the offset of some page pool
fields are the same in struct page and struct net_iov, but that's it.
Byungchul is already handling the entire
netmem_desc-shrink-struct-page in the series I linked to earlier and I
would say his work is going well.

>  (8) Reference the txbuf in a splice and provide a destructor that drops that
>      reference.  For small splices, I'd be very tempted to simply copy the
>      data.  For splice-out of data that was spliced into an AF_UNIX socket or
>      zerocopy data that passed through a loopback device, I'm also very
>      tempted to make splice copy at that point.  There's a potential DoS
>      attack whereby someone can endlessly splice tiny bits of a message or
>      just sit on them, preventing the original provider from recovering its
>      memory.
>
>  (9) Make it easy for a network filesystem to create an entire compound
>      message and present it to the socket in a single sendmsg() with a
>      destructor.
>
> I've pushed my current changes (very incomplete as they are) to:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental
>
> I'm writing functions to abstract out the loading of data into the txbuf chain
> and attach to skbuff.  These can be found in skbuff.c as net_txbuf_*().  I've
> modified the TCP sendmsg to use them.
>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
  2025-05-30 15:50 ` Stanislav Fomichev
  2025-05-30 16:22 ` Mina Almasry
@ 2025-06-04 14:56 ` David Howells
  2025-06-05 18:59   ` Mina Almasry
  2025-06-04 15:34 ` David Howells
  2025-06-04 15:59 ` David Howells
  4 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-06-04 14:56 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: dhowells, Mina Almasry, willy, hch, Jakub Kicinski, Eric Dumazet,
	netdev, linux-mm, linux-kernel

Stanislav Fomichev <stfomichev@gmail.com> wrote:

> >  (1) Separate fragment lifetime management from sk_buff.  No more wangling
> >      of refcounts in the skbuff code.  If you clone an skb, you stick an
> >      extra ref on the lifetime management struct, not the page.
> 
> For device memory TCP we already have this: net_devmem_dmabuf_binding
> is the owner of the frags. And when we reference skb frag we reference
> only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
> net_devmem_get_net_iov (ref on the binding).
>
> Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
> cases? From what I can tell, this is somewhat equivalent of your net_txbuf.

Yes and no.  The net_devmem stuff that's now upstream still manages refs on a
per-skb-frag basis.  What I'm looking to do is to move it out of the skb and
into a separate struct so that the ref on a chunk of memory can be shared
between several skb-frags, quite possibly spread between several skbs.

This is especially important for various types of zerocopy memory where we
won't actually be allowed to take refs.

David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
                   ` (2 preceding siblings ...)
  2025-06-04 14:56 ` David Howells
@ 2025-06-04 15:34 ` David Howells
  2025-06-05 19:27   ` Mina Almasry
  2025-06-04 15:59 ` David Howells
  4 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-06-04 15:34 UTC (permalink / raw)
  To: Mina Almasry
  Cc: dhowells, willy, hch, Jakub Kicinski, Eric Dumazet, netdev,
	linux-mm, linux-kernel

Mina Almasry <almasrymina@google.com> wrote:

> Hi David! Yes, very happy to collaborate.

:-)

> FWIW, my initial gut feeling is that the work doesn't conflict that much.
> The tcp devmem netmem/net_iov stuff is designed to follow the page stuff,
> and as the usage of struct page changes we're happy moving net_iovs and
> netmems to do the same thing. My read is that it will take a small amount of
> extra work, but there are no in-principle design conflicts, at least AFAICT
> so far.

The problem is more the code you changed in the current merge window I'm also
wanting to change, so merge conflicts will arise.

However, I'm also looking to move the points at which refs are taken/dropped
which will directly inpinge on the design of the code that's currently
upstream.

Would it help if I created some diagrams to show what I'm thinking of?

> I believe the main challenge here is that there are many code paths in
> the net stack that expect to be able to grab a ref on skb frags. See
> all the callers of skb_frag_ref/skb_frag_unref:
> 
> tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone,
> pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header,
> pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment.

Oh, yes, I've come to appreciate that well.  A chunk of that can actually go
away, I think.

> I think to accomplish what you're describing we need to modify
> skb_frag_ref to do something else other than taking a reference on the
> page or net_iov. I think maybe taking a reference on the skb itself
> may be acceptable, and the skb can 'guarantee' that the individual
> frags underneath it don't disappear while these functions are
> executing.

Maybe.  There is an issue with that, though it may not be insurmountable: If a
userspace process does, say, a MSG_ZEROCOPY send of a page worth of data over
TCP, under a typicalish MTU, say, 1500, this will be split across at least
three skbuffs.

This would involve making a call into GUP to get a pin - but we'd need a
separate pin for each skbuff and we might (in fact we currently do) end up
calling into GUP thrice to do the address translation and page pinning.

What I want to do is to put this outside of the skbuff so that GUP pin can be
shared - but if, instead, we attach a pin to each skbuff, we need to get that
extra pin in some way.  Now, it may be reasonable to add a "get me an extra
pin for such-and-such a range" thing and store the {physaddr,len} in the
skbuff fragment, but we also have to be careful not to overrun the pin count -
if there's even a pin count per se.

> But devmem TCP doesn't get much in the way here, AFAICT. It's really
> the fact that so many of the networking code paths want to obtain page
> refs via skb_frag_ref so there is potentially a lot of code to touch.

Yep.

> But, AFAICT, skb_frag_t needs a struct page inside of it, not just a
> physical address. skb_frags can mmap'd into userspace for TCP
> zerocopy, see tcp_zerocopy_vm_insert_batch (which is a very old
> feature, it's not a recent change). There may be other call paths in
> the net stack that require a full page and just a physical address
> will do. (unless somehow we can mmap a physical address to the
> userspace).

Yeah - I think this needs very careful consideration and will need some
adjustment.  Some of the pages that may, in the future, get zerocopied or
spliced into the socket *really* shouldn't be spliced out into some random
process's address space - and, in fact, may not even have a refcount (say they
come from the slab).  Basically, TCP has implemented async vmsplice()...

> Is struct net_txbuf intended to replace struct sk_buff in the tx path
> only? If so, I'm not sure that works.

No - the idea is that it runs a parallel track to it and holds "references" to
the buffer memory.  This is then divided up amongst a number of sk_buffs that
hold refs on the first txbuf that it uses memory from.

txbufs would allow us to take and hold a single ref or pin (or even nothing,
just a destructor) on each piece of supplied buffer memory and for that to be
shared between a sequence of skbufs.

> Currently TX and RX memory share a single data structure (sk_buff), and I
> believe that is critical.

Yep.  And we need to keep sk_buff because an sk_buff is more than just a
memory pinning device - it also retains the metadata for a packet.

> ... So I think, maybe, instead of introducing a new struct, you have to make
> the modifications you envision to struct sk_buff itself?

It may be possible.  But see above.  I want to be able to share pins between
sk_buffs.

> OK, you realize that TX packets can be forwarded to RX. The opposite
> is true, RX can be forwarded to TX. And it's not just AF_UNIX and
> loopback. Packets can be forwarded via ip forwarding, and tc, and
> probably another half dozen features in the net stack I don't know
> about.

Yeah, I'd noticed that.

> I think you need to modify the existing sk_buff. I think adding
> a new struct and migrating the entire net stack to use that is a bit
> too ambitious. But up to you. Just my 2 cents here.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
                   ` (3 preceding siblings ...)
  2025-06-04 15:34 ` David Howells
@ 2025-06-04 15:59 ` David Howells
  2025-06-05 19:30   ` Mina Almasry
  4 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-06-04 15:59 UTC (permalink / raw)
  To: Mina Almasry
  Cc: dhowells, willy, hch, Jakub Kicinski, Eric Dumazet, netdev,
	linux-mm, linux-kernel

(Apologies, I accidentally sent the incomplete email)

> I think you need to modify the existing sk_buff. I think adding
> a new struct and migrating the entire net stack to use that is a bit
> too ambitious. But up to you. Just my 2 cents here.

It may come down to that, and if it does, we'll need to handle frags
differently.  Basically, for zerocopy, the following will all apply or come to
apply sometime in the future:

 (1) We're going to be getting arrays of {physaddr,len} from the higher
     layers.  I think Christoph's idea is that this makes DMA mapping easier.
     We will need to retain this.

 (2) There will be no page refcount.  We will obtain a pin (user zc) or there
     will be a destructor (kernel zc).  If the latter, it may have to be
     shared amongst multiple skbuffs.

 (3) We can't assume that a frag refers to a single page - or that there's
     even a necessarily a page struct.  And if there is a page struct, the
     metadata isn't going to be stored in it.  The page struct will contain a
     typed pointer to some other struct (e.g. folio).

 (4) We can't assume anything about the memory type of the frag (could be
     slab, for instance).

Obviously, if we copy the data into netmem buffers, we have full control of
that memory type.

David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-06-04 14:56 ` David Howells
@ 2025-06-05 18:59   ` Mina Almasry
  0 siblings, 0 replies; 9+ messages in thread
From: Mina Almasry @ 2025-06-05 18:59 UTC (permalink / raw)
  To: David Howells
  Cc: Stanislav Fomichev, willy, hch, Jakub Kicinski, Eric Dumazet,
	netdev, linux-mm, linux-kernel

On Wed, Jun 4, 2025 at 7:56 AM David Howells <dhowells@redhat.com> wrote:
>
> Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> > >  (1) Separate fragment lifetime management from sk_buff.  No more wangling
> > >      of refcounts in the skbuff code.  If you clone an skb, you stick an
> > >      extra ref on the lifetime management struct, not the page.
> >
> > For device memory TCP we already have this: net_devmem_dmabuf_binding
> > is the owner of the frags. And when we reference skb frag we reference
> > only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
> > net_devmem_get_net_iov (ref on the binding).
> >
> > Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
> > cases? From what I can tell, this is somewhat equivalent of your net_txbuf.
>
> Yes and no.  The net_devmem stuff that's now upstream still manages refs on a
> per-skb-frag basis.

Actually Stan may be right here, something similar to the net_devmem
model may be what you want here.

The net_devmem stuff actually never grabs references on the frags
themselves, as Stan explained (which is what you want). We have an
object 'net_devmem_dmabuf_binding', which represents a chunk of pinned
devmem passed from userspace. When the net stack asks for a ref on a
frag, we grab a ref on the binding the frag belongs too in this call
path that Stan pointed to:

__skb_frag_ref -> get_netmem -> net_devmem_get_net_iov (ref on the binding).

This sounds earingly similar to what you want to do. You could have a
new struct (net_zcopy_mem) which represents a chunk of zerocopy memory
that you've pinned using GUP or whatever is the correct api is. Then
when the net stack wants a ref on a frag, you (somehow) figure out
which net_zcopy_mem it belongs to, and you grab a ref on the struct
rather than the frag.

Then when the refcount of net_zcopy_mem hits 0, you know you can
un-GUP the zcopy memory. I think that model in general may work. But
also it may be a case of everything looking like a nail to someone
with a hammer.

Better yet, we already have in the code a struct that represent
zerocopy memory, struct ubuf_info_msgzc. Instead of inventing a new
struct, you can reuse this one to do the memory pinning and
refcounting on behalf of the memory underneath?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-06-04 15:34 ` David Howells
@ 2025-06-05 19:27   ` Mina Almasry
  0 siblings, 0 replies; 9+ messages in thread
From: Mina Almasry @ 2025-06-05 19:27 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, Jakub Kicinski, Eric Dumazet, netdev, linux-mm,
	linux-kernel

On Wed, Jun 4, 2025 at 8:34 AM David Howells <dhowells@redhat.com> wrote:
> > FWIW, my initial gut feeling is that the work doesn't conflict that much.
> > The tcp devmem netmem/net_iov stuff is designed to follow the page stuff,
> > and as the usage of struct page changes we're happy moving net_iovs and
> > netmems to do the same thing. My read is that it will take a small amount of
> > extra work, but there are no in-principle design conflicts, at least AFAICT
> > so far.
>
> The problem is more the code you changed in the current merge window I'm also
> wanting to change, so merge conflicts will arise.
>
> However, I'm also looking to move the points at which refs are taken/dropped
> which will directly inpinge on the design of the code that's currently
> upstream.
>
> Would it help if I created some diagrams to show what I'm thinking of?
>

I think I understand what you want to do, but I'm happy looking at
diagrams or jumping on a call if needed.

[snip]

> > I think to accomplish what you're describing we need to modify
> > skb_frag_ref to do something else other than taking a reference on the
> > page or net_iov. I think maybe taking a reference on the skb itself
> > may be acceptable, and the skb can 'guarantee' that the individual
> > frags underneath it don't disappear while these functions are
> > executing.
>
> Maybe.  There is an issue with that, though it may not be insurmountable: If a
> userspace process does, say, a MSG_ZEROCOPY send of a page worth of data over
> TCP, under a typicalish MTU, say, 1500, this will be split across at least
> three skbuffs.
>
> This would involve making a call into GUP to get a pin - but we'd need a
> separate pin for each skbuff and we might (in fact we currently do) end up
> calling into GUP thrice to do the address translation and page pinning.
>
> What I want to do is to put this outside of the skbuff so that GUP pin can be
> shared - but if, instead, we attach a pin to each skbuff, we need to get that
> extra pin in some way.  Now, it may be reasonable to add a "get me an extra
> pin for such-and-such a range" thing and store the {physaddr,len} in the
> skbuff fragment, but we also have to be careful not to overrun the pin count -
> if there's even a pin count per se.
>

I think I understand. Currently the GUP is done in this call stack
(some helpers omitted), right?

tcp_send_message_locked
  skb_zerocopy_iter_stream
    zerocopy_fill_skb_from_iter
      iov_iter_get_pages2
        get_user_pages_fast

I think maybe the extra ref management you're referring to can be
tacked on to ubuf_info_msgzc? I still don't understand the need for a
completely new net_txbuf when the existing one seems to be almost what
you need, but I may be missing something.

I'm thinking, very roughly, I'm probably missing a lot of details:

1. Move the GUP call to msg_zerocopy_realloc, and save the pages array there.
2. Pass the ubuf_info_msgzc down to zerocopy_fill_skb_from_iter, and
have it fill the skb with pages from the GUP.
3. Modify skb_frag_ref such that if we want a reference on a frag that
belongs to a ubuf_info_msgzc, we grab a reference on the ubuf rather
than the frag.
4. Onces the ubuf_info_msgzc refcount hits 0, you can un-GUP the memory?

-- 
Thanks,
Mina


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device mem changes vs pinning/zerocopy changes
  2025-06-04 15:59 ` David Howells
@ 2025-06-05 19:30   ` Mina Almasry
  0 siblings, 0 replies; 9+ messages in thread
From: Mina Almasry @ 2025-06-05 19:30 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, Jakub Kicinski, Eric Dumazet, netdev, linux-mm,
	linux-kernel

On Wed, Jun 4, 2025 at 8:59 AM David Howells <dhowells@redhat.com> wrote:
>
> (Apologies, I accidentally sent the incomplete email)
>
> > I think you need to modify the existing sk_buff. I think adding
> > a new struct and migrating the entire net stack to use that is a bit
> > too ambitious. But up to you. Just my 2 cents here.
>
> It may come down to that, and if it does, we'll need to handle frags
> differently.  Basically, for zerocopy, the following will all apply or come to
> apply sometime in the future:
>
>  (1) We're going to be getting arrays of {physaddr,len} from the higher
>      layers.  I think Christoph's idea is that this makes DMA mapping easier.
>      We will need to retain this.
>

I would punt this to a follow up project. Currently the net stack uses
pages extensively; replacing them with scatterlist-like {physaddr,
len} sounds like a huge undertaking. Especially with the conversion to
netmem_desc happening in parallel with Byungchul's series. Just my 2
cents.

-- 
Thanks,
Mina


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-05 19:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 15:14 Device mem changes vs pinning/zerocopy changes David Howells
2025-05-30 15:50 ` Stanislav Fomichev
2025-05-30 16:22 ` Mina Almasry
2025-06-04 14:56 ` David Howells
2025-06-05 18:59   ` Mina Almasry
2025-06-04 15:34 ` David Howells
2025-06-05 19:27   ` Mina Almasry
2025-06-04 15:59 ` David Howells
2025-06-05 19:30   ` Mina Almasry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).