linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Network filesystems and netmem
@ 2025-08-08 13:16 David Howells
  2025-08-08 17:57 ` Mina Almasry
  2025-08-08 20:16 ` David Howells
  0 siblings, 2 replies; 4+ messages in thread
From: David Howells @ 2025-08-08 13:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: dhowells, willy, hch, Jakub Kicinski, Eric Dumazet,
	Byungchul Park, netfs, netdev, linux-mm, linux-kernel

Hi Mina,

Apologies for not keeping up with the stuff I proposed, but I had to go and do
a load of bugfixing.  Anyway, that gave me time to think about the netmem
allocator and how *that* may be something network filesystems can make use of.
I particularly like the way it can do DMA/IOMMU mapping in bulk (at least, if
I understand it aright).

So what I'm thinking of is changing the network filesystems - at least the
ones I can - from using kmalloc() to allocate memory for protocol fragments to
using the netmem allocator.  However, I think this might need to be
parameterisable by:

 (1) The socket.  We might want to group allocations relating to the same
     socket or destined to route through the same NIC together.

 (2) The destination address.  Again, we might need to group by NIC.  For TCP
     sockets, this likely doesn't matter as a connected TCP socket already
     knows this, but for a UDP socket, you can set that in sendmsg() (and
     indeed AF_RXRPC does just that).

 (3) The lifetime.  On a crude level, I would provide a hint flag that
     indicates whether it may be retained for some time (e.g. rxrpc DATA
     packets or TCP data) or whether the data is something we aren't going to
     retain (e.g. rxrpc ACK packets) as we might want to group these
     differently.

So what I'm thinking of is creating a net core API that looks something like:

	#define NETMEM_HINT_UNRETAINED 0x1
	void *netmem_alloc(struct socket *sock, size_t len, unsigned int hints);
	void *netmem_free(void *mem);

though I'm tempted to make it:

	int netmem_alloc(struct socket *sock, size_t len, unsigned int hints,
			 struct bio_vec *bv);
	void netmem_free(struct bio_vec *bv);

to accommodate Christoph's plans for the future of bio_vec.

I'm going to leave the pin vs ref for direct I/O and splice issues and the
zerocopy-completion issues for later.

I'm using cifs as a testcase for this idea and now have it able to do
MSG_SPLICE_PAGES, though at the moment it's just grabbing pages and copying
data into them in the transport layer rather than using a fragment allocator
or netmem.  See:

https://lore.kernel.org/linux-fsdevel/20250806203705.2560493-4-dhowells@redhat.com/T/#t
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=cifs-experimental

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Network filesystems and netmem
  2025-08-08 13:16 Network filesystems and netmem David Howells
@ 2025-08-08 17:57 ` Mina Almasry
  2025-08-08 20:16 ` David Howells
  1 sibling, 0 replies; 4+ messages in thread
From: Mina Almasry @ 2025-08-08 17:57 UTC (permalink / raw)
  To: David Howells, Jesper Dangaard Brouer, Ilias Apalodimas
  Cc: willy, hch, Jakub Kicinski, Eric Dumazet, Byungchul Park, netfs,
	netdev, linux-mm, linux-kernel

On Fri, Aug 8, 2025 at 6:16 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi Mina,
>
> Apologies for not keeping up with the stuff I proposed, but I had to go and do
> a load of bugfixing.  Anyway, that gave me time to think about the netmem
> allocator and how *that* may be something network filesystems can make use of.
> I particularly like the way it can do DMA/IOMMU mapping in bulk (at least, if
> I understand it aright).
>

What are you referring to as the netmem allocator? Is it the page_pool
in net/core/page_pool.c? That one can indeed alloc in bulk via
alloc_pages_bulk_node, but then just loops over them to do DMA mapping
individually. It does allow you to fragment a piece of dma-mapped
memory via page_pool_fragment_netmem though. Probably that's what
you're referring to.

I have had an ambition to reuse the netmem_ref infra we recently
developed to upgrade the page_pool such that it actually allocs a
hugepage and maps it once and reuses shards of that chunk, but never
got around to implementing that.

> So what I'm thinking of is changing the network filesystems - at least the
> ones I can - from using kmalloc() to allocate memory for protocol fragments to
> using the netmem allocator.  However, I think this might need to be
> parameterisable by:
>
>  (1) The socket.  We might want to group allocations relating to the same
>      socket or destined to route through the same NIC together.
>
>  (2) The destination address.  Again, we might need to group by NIC.  For TCP
>      sockets, this likely doesn't matter as a connected TCP socket already
>      knows this, but for a UDP socket, you can set that in sendmsg() (and
>      indeed AF_RXRPC does just that).
>

the page_pool model groups memory by NIC (struct netdev), not socket
or destination address. It may be feasible to extend it to be
per-socket, but I don't immediately understand what that entails
exactly. The page_pool uses the netdev for dma-mapping, i'm not sure
what it would use the socket or destination address for (unless it's
to grab the netdev :P).

>  (3) The lifetime.  On a crude level, I would provide a hint flag that
>      indicates whether it may be retained for some time (e.g. rxrpc DATA
>      packets or TCP data) or whether the data is something we aren't going to
>      retain (e.g. rxrpc ACK packets) as we might want to group these
>      differently.
>

Today the page_pool doesn't really care how long you hold onto the mem
allocated from it. It kinda has to, because the mem goes to different
sockets ,and some of these sockets are used by applications that will
read the memory and free it immediately, and some sockets may not be
read for a while (or leaked from the userspace entirely - eek). AFAIU
the page_pool lets you hold onto any mem you

> So what I'm thinking of is creating a net core API that looks something like:
>
>         #define NETMEM_HINT_UNRETAINED 0x1
>         void *netmem_alloc(struct socket *sock, size_t len, unsigned int hints);
>         void *netmem_free(void *mem);
>
> though I'm tempted to make it:
>
>         int netmem_alloc(struct socket *sock, size_t len, unsigned int hints,
>                          struct bio_vec *bv);
>         void netmem_free(struct bio_vec *bv);
>
> to accommodate Christoph's plans for the future of bio_vec.
>

Honestly the subject of whether to extend the page_pool or implement a
new allocator kinda comes up every once in a while.

The key issue is that the page_pool has quite strict benchmarks for
how fast it does recycling, see
tools/testing/selftests/net/bench/page_pool/. Changes that don't
introduce overhead to the fast-path could be accomodated, I think. I
don't know how the maintainers are going to feel about extending its
uses even further. It took a bit of convincing to get the zerocopy
memory provider stuff in as-is :D

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Network filesystems and netmem
  2025-08-08 13:16 Network filesystems and netmem David Howells
  2025-08-08 17:57 ` Mina Almasry
@ 2025-08-08 20:16 ` David Howells
  2025-08-08 23:28   ` Mina Almasry
  1 sibling, 1 reply; 4+ messages in thread
From: David Howells @ 2025-08-08 20:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: dhowells, Jesper Dangaard Brouer, Ilias Apalodimas, willy, hch,
	Jakub Kicinski, Eric Dumazet, Byungchul Park, netfs, netdev,
	linux-mm, linux-kernel

Mina Almasry <almasrymina@google.com> wrote:

> >  (1) The socket.  We might want to group allocations relating to the same
> >      socket or destined to route through the same NIC together.
> >
> >  (2) The destination address.  Again, we might need to group by NIC.  For TCP
> >      sockets, this likely doesn't matter as a connected TCP socket already
> >      knows this, but for a UDP socket, you can set that in sendmsg() (and
> >      indeed AF_RXRPC does just that).
> >
> 
> the page_pool model groups memory by NIC (struct netdev), not socket
> or destination address. It may be feasible to extend it to be
> per-socket, but I don't immediately understand what that entails
> exactly. The page_pool uses the netdev for dma-mapping, i'm not sure
> what it would use the socket or destination address for (unless it's
> to grab the netdev :P).

Yeah - but the network filesystem doesn't necessarily know anything about what
NIC would be used... but a connected TCP socket surely does.  Likewise, a UDP
socket has to perform an address lookup to find the destination/route and thus
the NIC.

So, basically all three, the socket, the address and the flag would be hints,
possibly unused for now.

> Today the page_pool doesn't really care how long you hold onto the mem
> allocated from it.

It's not so much whether the page pool cares how long we hold on to the mem,
but for a fragment allocator we want to group things together of similar
lifetime as we don't get to reuse the page until all the things in it have
been released.

And if we're doing bulk DMA/IOMMU mapping, we also potentially have a second
constraint: an IOMMU TLB entry may be keyed for a particular device.

> Honestly the subject of whether to extend the page_pool or implement a
> new allocator kinda comes up every once in a while.

Do we actually use the netmem page pools only for receiving?  If that's the
case, then do I need to be managing this myself?  Providing my own fragment
allocator that handles bulk DMA mapping, that is.  I'd prefer to use an
existing one if I can.

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Network filesystems and netmem
  2025-08-08 20:16 ` David Howells
@ 2025-08-08 23:28   ` Mina Almasry
  0 siblings, 0 replies; 4+ messages in thread
From: Mina Almasry @ 2025-08-08 23:28 UTC (permalink / raw)
  To: David Howells
  Cc: Jesper Dangaard Brouer, Ilias Apalodimas, willy, hch,
	Jakub Kicinski, Eric Dumazet, Byungchul Park, netfs, netdev,
	linux-mm, linux-kernel

On Fri, Aug 8, 2025 at 1:16 PM David Howells <dhowells@redhat.com> wrote:
>
> Mina Almasry <almasrymina@google.com> wrote:
>
> > >  (1) The socket.  We might want to group allocations relating to the same
> > >      socket or destined to route through the same NIC together.
> > >
> > >  (2) The destination address.  Again, we might need to group by NIC.  For TCP
> > >      sockets, this likely doesn't matter as a connected TCP socket already
> > >      knows this, but for a UDP socket, you can set that in sendmsg() (and
> > >      indeed AF_RXRPC does just that).
> > >
> >
> > the page_pool model groups memory by NIC (struct netdev), not socket
> > or destination address. It may be feasible to extend it to be
> > per-socket, but I don't immediately understand what that entails
> > exactly. The page_pool uses the netdev for dma-mapping, i'm not sure
> > what it would use the socket or destination address for (unless it's
> > to grab the netdev :P).
>
> Yeah - but the network filesystem doesn't necessarily know anything about what
> NIC would be used... but a connected TCP socket surely does.  Likewise, a UDP
> socket has to perform an address lookup to find the destination/route and thus
> the NIC.
>
> So, basically all three, the socket, the address and the flag would be hints,
> possibly unused for now.
>
> > Today the page_pool doesn't really care how long you hold onto the mem
> > allocated from it.
>
> It's not so much whether the page pool cares how long we hold on to the mem,
> but for a fragment allocator we want to group things together of similar
> lifetime as we don't get to reuse the page until all the things in it have
> been released.
>
> And if we're doing bulk DMA/IOMMU mapping, we also potentially have a second
> constraint: an IOMMU TLB entry may be keyed for a particular device.
>
> > Honestly the subject of whether to extend the page_pool or implement a
> > new allocator kinda comes up every once in a while.
>
> Do we actually use the netmem page pools only for receiving?  If that's the
> case, then do I need to be managing this myself?  Providing my own fragment
> allocator that handles bulk DMA mapping, that is.  I'd prefer to use an
> existing one if I can.
>

Yes we only use page_pools for receiving at the moment. Some
discussion around using the page_pool for normal TX networking
happened in the past, but I can't find the thread.

I'm unsure what it would take to make it some-tx-path compatible off
the top of my head. At the very least, the page_pool at the moment has
some dependency/logic on napi-id that it may get from the driver, that
may need to be factored out. See all the places we touch pool->p.napi
in page_pool.c and other files. Or, like you said, you may want your
own fragment allocator, if wrestling the page_pool to do what you want
is too cumbersome.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-08-08 23:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-08 13:16 Network filesystems and netmem David Howells
2025-08-08 17:57 ` Mina Almasry
2025-08-08 20:16 ` David Howells
2025-08-08 23:28   ` Mina Almasry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).