[PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
@ 2025-04-23 15:35 Cosmin Ratiu
  2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Cosmin Ratiu @ 2025-04-23 15:35 UTC (permalink / raw)
  To: netdev, cratiu
  Cc: Jason Gunthorpe, Leon Romanovsky, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Saeed Mahameed, Tariq Toukan, Dragos Tatulea, linux-kselftest

Drivers that are told to allocate RX buffers from pools of DMA memory
should have enough memory in the pool to satisfy projected allocation
requests (a function of ring size, MTU & other parameters). If there's
not enough memory, RX ring refill might fail later at inconvenient times
(e.g. during NAPI poll).

This commit adds a check at dmabuf pool init time that compares the
amount of memory in the underlying chunk pool (configured by the user
space application providing dmabuf memory) with the desired pool size
(previously set by the driver) and fails with an error message if chunk
memory isn't enough.

Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 net/core/devmem.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 6e27a47d0493..651cd55ebb28 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 int mp_dmabuf_devmem_init(struct page_pool *pool)
 {
 	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+	size_t size;
 
 	if (!binding)
 		return -EINVAL;
@@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
 	if (pool->p.order != 0)
 		return -E2BIG;
 
+	/* Validate that the underlying dmabuf has enough memory to satisfy
+	 * requested pool size.
+	 */
+	size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
+	if (size < pool->p.pool_size) {
+		pr_warn("%s: Insufficient dmabuf memory (%zu pages) to satisfy pool_size (%u pages)\n",
+			__func__, size, pool->p.pool_size);
+		return -ENOMEM;
+	}
+
 	net_devmem_dmabuf_binding_get(binding);
 	return 0;
 }
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array
  2025-04-23 15:35 [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Cosmin Ratiu
@ 2025-04-23 15:35 ` Cosmin Ratiu
  2025-04-23 16:54   ` Stanislav Fomichev
  2025-04-23 18:49   ` Mina Almasry
  2025-04-23 16:53 ` [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Stanislav Fomichev
  2025-04-23 17:30 ` Mina Almasry
  2 siblings, 2 replies; 15+ messages in thread
From: Cosmin Ratiu @ 2025-04-23 15:35 UTC (permalink / raw)
  To: netdev, cratiu
  Cc: Jason Gunthorpe, Leon Romanovsky, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Saeed Mahameed, Tariq Toukan, Dragos Tatulea, linux-kselftest

netdev_bind_rx takes ownership of the queue array passed as parameter
and frees it, so a queue array buffer cannot be reused across multiple
netdev_bind_rx calls.

This commit fixes that by always passing in a newly created queue array
to all netdev_bind_rx calls in ncdevmem.

Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 .../selftests/drivers/net/hw/ncdevmem.c       | 55 ++++++++-----------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 2bf14ac2b8c6..9d48004ff1a1 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -431,6 +431,22 @@ static int parse_address(const char *str, int port, struct sockaddr_in6 *sin6)
 	return 0;
 }
 
+static struct netdev_queue_id *create_queues(void)
+{
+	struct netdev_queue_id *queues;
+	size_t i = 0;
+
+	queues = calloc(num_queues, sizeof(*queues));
+	for (i = 0; i < num_queues; i++) {
+		queues[i]._present.type = 1;
+		queues[i]._present.id = 1;
+		queues[i].type = NETDEV_QUEUE_TYPE_RX;
+		queues[i].id = start_queue + i;
+	}
+
+	return queues;
+}
+
 int do_server(struct memory_buffer *mem)
 {
 	char ctrl_data[sizeof(int) * 20000];
@@ -448,7 +464,6 @@ int do_server(struct memory_buffer *mem)
 	char buffer[256];
 	int socket_fd;
 	int client_fd;
-	size_t i = 0;
 	int ret;
 
 	ret = parse_address(server_ip, atoi(port), &server_sin);
@@ -471,16 +486,7 @@ int do_server(struct memory_buffer *mem)
 
 	sleep(1);
 
-	queues = malloc(sizeof(*queues) * num_queues);
-
-	for (i = 0; i < num_queues; i++) {
-		queues[i]._present.type = 1;
-		queues[i]._present.id = 1;
-		queues[i].type = NETDEV_QUEUE_TYPE_RX;
-		queues[i].id = start_queue + i;
-	}
-
-	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys))
+	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys))
 		error(1, 0, "Failed to bind\n");
 
 	tmp_mem = malloc(mem->size);
@@ -545,7 +551,6 @@ int do_server(struct memory_buffer *mem)
 			goto cleanup;
 		}
 
-		i++;
 		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
 			if (cm->cmsg_level != SOL_SOCKET ||
 			    (cm->cmsg_type != SCM_DEVMEM_DMABUF &&
@@ -630,10 +635,8 @@ int do_server(struct memory_buffer *mem)
 
 void run_devmem_tests(void)
 {
-	struct netdev_queue_id *queues;
 	struct memory_buffer *mem;
 	struct ynl_sock *ys;
-	size_t i = 0;
 
 	mem = provider->alloc(getpagesize() * NUM_PAGES);
 
@@ -641,38 +644,24 @@ void run_devmem_tests(void)
 	if (configure_rss())
 		error(1, 0, "rss error\n");
 
-	queues = calloc(num_queues, sizeof(*queues));
-
 	if (configure_headersplit(1))
 		error(1, 0, "Failed to configure header split\n");
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys))
+	if (!bind_rx_queue(ifindex, mem->fd,
+			   calloc(num_queues, sizeof(struct netdev_queue_id)),
+			   num_queues, &ys))
 		error(1, 0, "Binding empty queues array should have failed\n");
 
-	for (i = 0; i < num_queues; i++) {
-		queues[i]._present.type = 1;
-		queues[i]._present.id = 1;
-		queues[i].type = NETDEV_QUEUE_TYPE_RX;
-		queues[i].id = start_queue + i;
-	}
-
 	if (configure_headersplit(0))
 		error(1, 0, "Failed to configure header split\n");
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys))
+	if (!bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys))
 		error(1, 0, "Configure dmabuf with header split off should have failed\n");
 
 	if (configure_headersplit(1))
 		error(1, 0, "Failed to configure header split\n");
 
-	for (i = 0; i < num_queues; i++) {
-		queues[i]._present.type = 1;
-		queues[i]._present.id = 1;
-		queues[i].type = NETDEV_QUEUE_TYPE_RX;
-		queues[i].id = start_queue + i;
-	}
-
-	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys))
+	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys))
 		error(1, 0, "Failed to bind\n");
 
 	/* Deactivating a bound queue should not be legal */
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-23 15:35 [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Cosmin Ratiu
  2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
@ 2025-04-23 16:53 ` Stanislav Fomichev
  2025-04-23 17:30 ` Mina Almasry
  2 siblings, 0 replies; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-23 16:53 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jason Gunthorpe, Leon Romanovsky, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Saeed Mahameed, Tariq Toukan, Dragos Tatulea,
	linux-kselftest

On 04/23, Cosmin Ratiu wrote:
> Drivers that are told to allocate RX buffers from pools of DMA memory
> should have enough memory in the pool to satisfy projected allocation
> requests (a function of ring size, MTU & other parameters). If there's
> not enough memory, RX ring refill might fail later at inconvenient times
> (e.g. during NAPI poll).
> 
> This commit adds a check at dmabuf pool init time that compares the
> amount of memory in the underlying chunk pool (configured by the user
> space application providing dmabuf memory) with the desired pool size
> (previously set by the driver) and fails with an error message if chunk
> memory isn't enough.
> 
> Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  net/core/devmem.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 6e27a47d0493..651cd55ebb28 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
>  int mp_dmabuf_devmem_init(struct page_pool *pool)
>  {
>  	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> +	size_t size;
>  
>  	if (!binding)
>  		return -EINVAL;
> @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
>  	if (pool->p.order != 0)
>  		return -E2BIG;
>  
> +	/* Validate that the underlying dmabuf has enough memory to satisfy
> +	 * requested pool size.
> +	 */

I think it's useful to have a check, but note that this check is in no
way a guarantee that the genpool has enough capacity. We can use the
same binding on multiple queues... Can you expand the comment a bit
to explain that it's more of a sanity check than a guarantee?

> +	size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> +	if (size < pool->p.pool_size) {
> +		pr_warn("%s: Insufficient dmabuf memory (%zu pages) to satisfy pool_size (%u pages)\n",

Let's print the sizes in bytes? We might have order>0 pages soon in the
pp: https://lore.kernel.org/netdev/20250421222827.283737-1-kuba@kernel.org/T/#t

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array
  2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
@ 2025-04-23 16:54   ` Stanislav Fomichev
  2025-04-23 18:49   ` Mina Almasry
  1 sibling, 0 replies; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-23 16:54 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jason Gunthorpe, Leon Romanovsky, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Saeed Mahameed, Tariq Toukan, Dragos Tatulea,
	linux-kselftest

On 04/23, Cosmin Ratiu wrote:
> netdev_bind_rx takes ownership of the queue array passed as parameter
> and frees it, so a queue array buffer cannot be reused across multiple
> netdev_bind_rx calls.
> 
> This commit fixes that by always passing in a newly created queue array
> to all netdev_bind_rx calls in ncdevmem.
> 
> Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-23 15:35 [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Cosmin Ratiu
  2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
  2025-04-23 16:53 ` [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Stanislav Fomichev
@ 2025-04-23 17:30 ` Mina Almasry
  2025-04-23 20:15   ` Stanislav Fomichev
  2025-04-24  8:47   ` Cosmin Ratiu
  2 siblings, 2 replies; 15+ messages in thread
From: Mina Almasry @ 2025-04-23 17:30 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jason Gunthorpe, Leon Romanovsky, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Saeed Mahameed, Tariq Toukan, Dragos Tatulea,
	linux-kselftest

On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
>
> Drivers that are told to allocate RX buffers from pools of DMA memory
> should have enough memory in the pool to satisfy projected allocation
> requests (a function of ring size, MTU & other parameters). If there's
> not enough memory, RX ring refill might fail later at inconvenient times
> (e.g. during NAPI poll).
>

My understanding is that if the RX ring refill fails, the driver will
post the buffers it was able to allocate data for, and will not post
other buffers. So it will run with a degraded performance but nothing
overly bad should happen. This should be the same behavior if the
machine is under memory pressure.

In general I don't know about this change. If the user wants to use
very small dmabufs, they should be able to, without going through
hoops reducing the number of rx ring slots the driver has (if it
supports configuring that).

I think maybe printing an error or warning that the dmabuf is too
small for the pool_size may be fine. But outright failing this
configuration? I don't think so.

> This commit adds a check at dmabuf pool init time that compares the
> amount of memory in the underlying chunk pool (configured by the user
> space application providing dmabuf memory) with the desired pool size
> (previously set by the driver) and fails with an error message if chunk
> memory isn't enough.
>
> Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  net/core/devmem.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 6e27a47d0493..651cd55ebb28 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
>  int mp_dmabuf_devmem_init(struct page_pool *pool)
>  {
>         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> +       size_t size;
>
>         if (!binding)
>                 return -EINVAL;
> @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
>         if (pool->p.order != 0)
>                 return -E2BIG;
>
> +       /* Validate that the underlying dmabuf has enough memory to satisfy
> +        * requested pool size.
> +        */
> +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> +       if (size < pool->p.pool_size) {

pool_size seems to be the number of ptr_ring slots in the page_pool,
not some upper or lower bound on the amount of memory the page_pool
can provide. So this check seems useless? The page_pool can still not
provide this amount of memory with dmabuf (if the netmems aren't being
recycled fast enough) or with normal memory (under memory pressure).


> +               pr_warn("%s: Insufficient dmabuf memory (%zu pages) to satisfy pool_size (%u pages)\n",
> +                       __func__, size, pool->p.pool_size);
> +               return -ENOMEM;

In general I think maybe printing an error/warn to dmesg to warn the
user that this is a weird configuration may be fine, but return
-ENOMEM? I don't think this should be an unsupported configuration and
I don't think checking against p.pool_size guarantees anything.

> +       }
> +
>         net_devmem_dmabuf_binding_get(binding);
>         return 0;
>  }
> --
> 2.45.0
>
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array
  2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
  2025-04-23 16:54   ` Stanislav Fomichev
@ 2025-04-23 18:49   ` Mina Almasry
  1 sibling, 0 replies; 15+ messages in thread
From: Mina Almasry @ 2025-04-23 18:49 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jason Gunthorpe, Leon Romanovsky, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Saeed Mahameed, Tariq Toukan, Dragos Tatulea,
	linux-kselftest

On Wed, Apr 23, 2025 at 9:00 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
>
> netdev_bind_rx takes ownership of the queue array passed as parameter
> and frees it, so a queue array buffer cannot be reused across multiple
> netdev_bind_rx calls.
>
> This commit fixes that by always passing in a newly created queue array
> to all netdev_bind_rx calls in ncdevmem.
>
> Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>

Thank you!

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-23 17:30 ` Mina Almasry
@ 2025-04-23 20:15   ` Stanislav Fomichev
  2025-04-24 20:57     ` Mina Almasry
  2025-04-24  8:47   ` Cosmin Ratiu
  1 sibling, 1 reply; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-23 20:15 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On 04/23, Mina Almasry wrote:
> On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> >
> > Drivers that are told to allocate RX buffers from pools of DMA memory
> > should have enough memory in the pool to satisfy projected allocation
> > requests (a function of ring size, MTU & other parameters). If there's
> > not enough memory, RX ring refill might fail later at inconvenient times
> > (e.g. during NAPI poll).
> >
> 
> My understanding is that if the RX ring refill fails, the driver will
> post the buffers it was able to allocate data for, and will not post
> other buffers. So it will run with a degraded performance but nothing
> overly bad should happen. This should be the same behavior if the
> machine is under memory pressure.
> 
> In general I don't know about this change. If the user wants to use
> very small dmabufs, they should be able to, without going through
> hoops reducing the number of rx ring slots the driver has (if it
> supports configuring that).
> 
> I think maybe printing an error or warning that the dmabuf is too
> small for the pool_size may be fine. But outright failing this
> configuration? I don't think so.
> 
> > This commit adds a check at dmabuf pool init time that compares the
> > amount of memory in the underlying chunk pool (configured by the user
> > space application providing dmabuf memory) with the desired pool size
> > (previously set by the driver) and fails with an error message if chunk
> > memory isn't enough.
> >
> > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > ---
> >  net/core/devmem.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 6e27a47d0493..651cd55ebb28 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> >  {
> >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > +       size_t size;
> >
> >         if (!binding)
> >                 return -EINVAL;
> > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> >         if (pool->p.order != 0)
> >                 return -E2BIG;
> >
> > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > +        * requested pool size.
> > +        */
> > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > +       if (size < pool->p.pool_size) {
> 
> pool_size seems to be the number of ptr_ring slots in the page_pool,
> not some upper or lower bound on the amount of memory the page_pool
> can provide. So this check seems useless? The page_pool can still not
> provide this amount of memory with dmabuf (if the netmems aren't being
> recycled fast enough) or with normal memory (under memory pressure).

I read this check more as "is there enough chunks in the binding to
fully fill in the page pool". User controls the size of rx ring which
controls the size of the page pool which somewhat dictates the minimal
size of the binding (maybe). So it's more of a sanity check.

Maybe having better defaults in ncdevmem would've been a better option? It
allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
PP wants to get 64MB at least..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-23 17:30 ` Mina Almasry
  2025-04-23 20:15   ` Stanislav Fomichev
@ 2025-04-24  8:47   ` Cosmin Ratiu
  2025-04-24 20:50     ` Mina Almasry
  1 sibling, 1 reply; 15+ messages in thread
From: Cosmin Ratiu @ 2025-04-24  8:47 UTC (permalink / raw)
  To: almasrymina@google.com
  Cc: Jason Gunthorpe, andrew+netdev@lunn.ch, davem@davemloft.net,
	Tariq Toukan, linux-kselftest@vger.kernel.org, Dragos Tatulea,
	Saeed Mahameed, pabeni@redhat.com, Leon Romanovsky,
	edumazet@google.com, netdev@vger.kernel.org, horms@kernel.org,
	kuba@kernel.org

On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
> On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com>
> wrote:
> > 
> > Drivers that are told to allocate RX buffers from pools of DMA
> > memory
> > should have enough memory in the pool to satisfy projected
> > allocation
> > requests (a function of ring size, MTU & other parameters). If
> > there's
> > not enough memory, RX ring refill might fail later at inconvenient
> > times
> > (e.g. during NAPI poll).
> > 
> 
> My understanding is that if the RX ring refill fails, the driver will
> post the buffers it was able to allocate data for, and will not post
> other buffers. So it will run with a degraded performance but nothing
> overly bad should happen. This should be the same behavior if the
> machine is under memory pressure.

What motivated this change was a failure in how mlx5 refills rings
today. For efficiency, ring refill triggered by NAPI polling only
releases old buffers just before allocating new ones so effectively has
a built-in assumption that the ring can be filled. Commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better
recycling") is an interesting read here.

For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes,
where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free
the old buffer and reallocate a new one at the same position. This has
excellent cache-locality and the pages returned to the pool will be
reused by the new descriptor.

The bug in mlx5 is that with a large MTU & ring size, the ring cannot
be fully populated with rx descriptors because the pool doesn't have
enough memory, but there's no memory released back to the pool for new
ones. Eventually, rx descriptors are exhausted and traffic stops.

> In general I don't know about this change. If the user wants to use
> very small dmabufs, they should be able to, without going through
> hoops reducing the number of rx ring slots the driver has (if it
> supports configuring that).
> 
> I think maybe printing an error or warning that the dmabuf is too
> small for the pool_size may be fine. But outright failing this
> configuration? I don't think so.

For regular memory-backed page pools, there's no hard limit of how big
they can become (except available physical memory), so this issue was
not seen before.

I didn't look at other drivers yet, but is it expected that drivers
operate with incompletely filled rings? I assumed that since the user
configured a specified ring size, it expects drivers to be able to use
that size and not silently operate in degraded mode, with a smaller
ring size.

If you think drivers should work in degraded mode, we can look at
improving the ring population code to work better in this scenario.

> pool_size seems to be the number of ptr_ring slots in the page_pool,
> not some upper or lower bound on the amount of memory the page_pool
> can provide. So this check seems useless? The page_pool can still not
> provide this amount of memory with dmabuf (if the netmems aren't
> being
> recycled fast enough) or with normal memory (under memory pressure).

I think pool_size is usually set by drivers based on their params, and
it's the max size of pool->ring. The opportunistic check I added
compares this demand with the supply (available chunk memory) and fails
this config based on the assumption that there should be enough memory
in the pool to satisfy driver needs.

Please let me know your thoughts and how to proceed.

Cosmin.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24  8:47   ` Cosmin Ratiu
@ 2025-04-24 20:50     ` Mina Almasry
  0 siblings, 0 replies; 15+ messages in thread
From: Mina Almasry @ 2025-04-24 20:50 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Jason Gunthorpe, andrew+netdev@lunn.ch, davem@davemloft.net,
	Tariq Toukan, linux-kselftest@vger.kernel.org, Dragos Tatulea,
	Saeed Mahameed, pabeni@redhat.com, Leon Romanovsky,
	edumazet@google.com, netdev@vger.kernel.org, horms@kernel.org,
	kuba@kernel.org

On Thu, Apr 24, 2025 at 1:47 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
>
> On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
> > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com>
> > wrote:
> > >
> > > Drivers that are told to allocate RX buffers from pools of DMA
> > > memory
> > > should have enough memory in the pool to satisfy projected
> > > allocation
> > > requests (a function of ring size, MTU & other parameters). If
> > > there's
> > > not enough memory, RX ring refill might fail later at inconvenient
> > > times
> > > (e.g. during NAPI poll).
> > >
> >
> > My understanding is that if the RX ring refill fails, the driver will
> > post the buffers it was able to allocate data for, and will not post
> > other buffers. So it will run with a degraded performance but nothing
> > overly bad should happen. This should be the same behavior if the
> > machine is under memory pressure.
>
> What motivated this change was a failure in how mlx5 refills rings
> today. For efficiency, ring refill triggered by NAPI polling only
> releases old buffers just before allocating new ones so effectively has
> a built-in assumption that the ring can be filled. Commit 4c2a13236807
> ("net/mlx5e: RX, Defer page release in striding rq for better
> recycling") is an interesting read here.
>
> For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes,
> where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free
> the old buffer and reallocate a new one at the same position. This has
> excellent cache-locality and the pages returned to the pool will be
> reused by the new descriptor.
>

Thanks for the detailed explanation. These seem like a clever optimization.

> The bug in mlx5 is that with a large MTU & ring size, the ring cannot
> be fully populated with rx descriptors because the pool doesn't have
> enough memory, but there's no memory released back to the pool for new
> ones. Eventually, rx descriptors are exhausted and traffic stops.
>
> > In general I don't know about this change. If the user wants to use
> > very small dmabufs, they should be able to, without going through
> > hoops reducing the number of rx ring slots the driver has (if it
> > supports configuring that).
> >
> > I think maybe printing an error or warning that the dmabuf is too
> > small for the pool_size may be fine. But outright failing this
> > configuration? I don't think so.
>
> For regular memory-backed page pools, there's no hard limit of how big
> they can become (except available physical memory), so this issue was
> not seen before.
>
> I didn't look at other drivers yet, but is it expected that drivers
> operate with incompletely filled rings? I assumed that since the user
> configured a specified ring size, it expects drivers to be able to use
> that size and not silently operate in degraded mode, with a smaller
> ring size.
>
> If you think drivers should work in degraded mode, we can look at
> improving the ring population code to work better in this scenario.
>

You're probably a bigger expert to me on what drivers should do in
general, but yes, this seems like an mlx5 limitation, not a general
limitation to all drivers. GVE for example, I think, has a host of
optimization for memory pressure scenarios that makes it resilient to
this. I ran this test:

mina-3 /home/almasrymina_google_com # ethtool -g eth0
Ring parameters for eth0:
...
Current hardware settings:
...
RX:             1024

Then hacked ncdevmem to only provide a 64 page udmabuf:

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 1f9fb0b1cb811..6de64f7680241 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -76,7 +76,7 @@

 #define PAGE_SHIFT 12
 #define TEST_PREFIX "ncdevmem"
-#define NUM_PAGES 16000
+#define NUM_PAGES 64

 #ifndef MSG_SOCK_DEVMEM
 #define MSG_SOCK_DEVMEM 0x2000000

To my surprise, the test passed just fine. Seems the limitation at
least doesn't apply to GVE. I don't know what the rest of the drivers
do, but so far this seems like driver specific behavior. I think
putting limitations in the core stack for mlx5 issues doesn't seem
great.

> > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > not some upper or lower bound on the amount of memory the page_pool
> > can provide. So this check seems useless? The page_pool can still not
> > provide this amount of memory with dmabuf (if the netmems aren't
> > being
> > recycled fast enough) or with normal memory (under memory pressure).
>
> I think pool_size is usually set by drivers based on their params, and
> it's the max size of pool->ring. The opportunistic check I added
> compares this demand with the supply (available chunk memory) and fails
> this config based on the assumption that there should be enough memory
> in the pool to satisfy driver needs.
>
> Please let me know your thoughts and how to proceed.

I think there are better options here:

1. In the page_pool, warn if the dmabuf is too small for the ring
size, but don't outright prevent the configuration. If the user is
running on a driver that doesn't have a dmabuf size limitation, let
them ignore the warning and run it.

2. In mlx5 code, somehow find out how big the dmabuf size is (it may
need a new pp api), and then in mlx5 code prevent this configuration
to workaround your (I think) driver-specific limitation.

3. Maybe address the general limitation in mlx5, and make it work even
if it can't refill its rings? It would help this case and other memory
pressure scenarios as well.





-- 
Thanks,
Mina

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-23 20:15   ` Stanislav Fomichev
@ 2025-04-24 20:57     ` Mina Almasry
  2025-04-24 22:10       ` Stanislav Fomichev
  0 siblings, 1 reply; 15+ messages in thread
From: Mina Almasry @ 2025-04-24 20:57 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 04/23, Mina Almasry wrote:
> > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > >
> > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > should have enough memory in the pool to satisfy projected allocation
> > > requests (a function of ring size, MTU & other parameters). If there's
> > > not enough memory, RX ring refill might fail later at inconvenient times
> > > (e.g. during NAPI poll).
> > >
> >
> > My understanding is that if the RX ring refill fails, the driver will
> > post the buffers it was able to allocate data for, and will not post
> > other buffers. So it will run with a degraded performance but nothing
> > overly bad should happen. This should be the same behavior if the
> > machine is under memory pressure.
> >
> > In general I don't know about this change. If the user wants to use
> > very small dmabufs, they should be able to, without going through
> > hoops reducing the number of rx ring slots the driver has (if it
> > supports configuring that).
> >
> > I think maybe printing an error or warning that the dmabuf is too
> > small for the pool_size may be fine. But outright failing this
> > configuration? I don't think so.
> >
> > > This commit adds a check at dmabuf pool init time that compares the
> > > amount of memory in the underlying chunk pool (configured by the user
> > > space application providing dmabuf memory) with the desired pool size
> > > (previously set by the driver) and fails with an error message if chunk
> > > memory isn't enough.
> > >
> > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > ---
> > >  net/core/devmem.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index 6e27a47d0493..651cd55ebb28 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > >  {
> > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > +       size_t size;
> > >
> > >         if (!binding)
> > >                 return -EINVAL;
> > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > >         if (pool->p.order != 0)
> > >                 return -E2BIG;
> > >
> > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > +        * requested pool size.
> > > +        */
> > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > +       if (size < pool->p.pool_size) {
> >
> > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > not some upper or lower bound on the amount of memory the page_pool
> > can provide. So this check seems useless? The page_pool can still not
> > provide this amount of memory with dmabuf (if the netmems aren't being
> > recycled fast enough) or with normal memory (under memory pressure).
>
> I read this check more as "is there enough chunks in the binding to
> fully fill in the page pool". User controls the size of rx ring

Only on drivers that support ethtool -G, and where it will let you
configure -G to what you want.

> which
> controls the size of the page pool which somewhat dictates the minimal
> size of the binding (maybe).

See the test I ran in the other thread. Seems at least GVE is fine
with dmabuf size < ring size. I don't know what other drivers do, but
generally speaking I think specific driver limitations should not
limit what others can do with their drivers. Sure for the GPU mem
applications you're probably looking at the dmabufs are huge and
supporting small dmabufs is not a concern, but someone somewhere may
want to run with 1 MB dmabuf for some use case and if their driver is
fine with it, core should not prevent it, I think.

> So it's more of a sanity check.
>
> Maybe having better defaults in ncdevmem would've been a better option? It
> allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> PP wants to get 64MB at least..

Yeah, udmabuf has a limitation that it only supports 64MB max size
last I looked.

I added devmem TCP support with udmabuf selftests to demonstrate that
the feature is non-proprietary, not to advertise that devmem tcp +
udmabuf is a great combination. udmabuf is actually terrible for
devmem TCP. The 64MB limit is way too small for anyone to do anything
performant on it and by dmaing into host memory you lose many of the
benefits of devmem TCP (lower mem bw + pcie bw utilization).

If you're running real experiments with devmem TCP I suggest moving to
real dmabufs as soon as possible, or at least hack udmabuf to give you
large sizes. We've open sourced our production devmem TCP userspace:

https://github.com/google/tcpgpudmarxd
https://github.com/google/nccl-plugin-gpudirecttcpx

Porting it to upstream APIs + your dmabuf provider will have you run
much more interesting tests than anything you do with udmabuf I think,
unless you hack the udmabuf size.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24 20:57     ` Mina Almasry
@ 2025-04-24 22:10       ` Stanislav Fomichev
  2025-04-24 22:26         ` Mina Almasry
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-24 22:10 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On 04/24, Mina Almasry wrote:
> On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 04/23, Mina Almasry wrote:
> > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > > >
> > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > should have enough memory in the pool to satisfy projected allocation
> > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > (e.g. during NAPI poll).
> > > >
> > >
> > > My understanding is that if the RX ring refill fails, the driver will
> > > post the buffers it was able to allocate data for, and will not post
> > > other buffers. So it will run with a degraded performance but nothing
> > > overly bad should happen. This should be the same behavior if the
> > > machine is under memory pressure.
> > >
> > > In general I don't know about this change. If the user wants to use
> > > very small dmabufs, they should be able to, without going through
> > > hoops reducing the number of rx ring slots the driver has (if it
> > > supports configuring that).
> > >
> > > I think maybe printing an error or warning that the dmabuf is too
> > > small for the pool_size may be fine. But outright failing this
> > > configuration? I don't think so.
> > >
> > > > This commit adds a check at dmabuf pool init time that compares the
> > > > amount of memory in the underlying chunk pool (configured by the user
> > > > space application providing dmabuf memory) with the desired pool size
> > > > (previously set by the driver) and fails with an error message if chunk
> > > > memory isn't enough.
> > > >
> > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > ---
> > > >  net/core/devmem.c | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > >  {
> > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > +       size_t size;
> > > >
> > > >         if (!binding)
> > > >                 return -EINVAL;
> > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > >         if (pool->p.order != 0)
> > > >                 return -E2BIG;
> > > >
> > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > +        * requested pool size.
> > > > +        */
> > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > +       if (size < pool->p.pool_size) {
> > >
> > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > not some upper or lower bound on the amount of memory the page_pool
> > > can provide. So this check seems useless? The page_pool can still not
> > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > recycled fast enough) or with normal memory (under memory pressure).
> >
> > I read this check more as "is there enough chunks in the binding to
> > fully fill in the page pool". User controls the size of rx ring
> 
> Only on drivers that support ethtool -G, and where it will let you
> configure -G to what you want.

gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
the rings.

> > which
> > controls the size of the page pool which somewhat dictates the minimal
> > size of the binding (maybe).
> 
> See the test I ran in the other thread. Seems at least GVE is fine
> with dmabuf size < ring size. I don't know what other drivers do, but
> generally speaking I think specific driver limitations should not
> limit what others can do with their drivers. Sure for the GPU mem
> applications you're probably looking at the dmabufs are huge and
> supporting small dmabufs is not a concern, but someone somewhere may
> want to run with 1 MB dmabuf for some use case and if their driver is
> fine with it, core should not prevent it, I think.
> 
> > So it's more of a sanity check.
> >
> > Maybe having better defaults in ncdevmem would've been a better option? It
> > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > PP wants to get 64MB at least..
> 
> Yeah, udmabuf has a limitation that it only supports 64MB max size
> last I looked.

We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
more than 64MB, ncdevmem can change it. Or warn the user similar
to what kperf does: https://github.com/facebookexperimental/kperf/blob/main/devmem.c#L308

So either having a kernel warn or tuning 63MB up to something sensible
(1G?) should prevent people from going through the pain..

> I added devmem TCP support with udmabuf selftests to demonstrate that
> the feature is non-proprietary, not to advertise that devmem tcp +
> udmabuf is a great combination. udmabuf is actually terrible for
> devmem TCP. The 64MB limit is way too small for anyone to do anything
> performant on it and by dmaing into host memory you lose many of the
> benefits of devmem TCP (lower mem bw + pcie bw utilization).

It would still be nice to have a udmabuf as a properly supported option.
This can drive the UAPI performance conversions: for example, comparing
existing tcp rx zerocopy vs MSG_SOCK_DEVMEM.. So let's not completely
dismiss it. We've played internally with doing 2MB udmabuf huge-pages,
might post it at some point..

> If you're running real experiments with devmem TCP I suggest moving to
> real dmabufs as soon as possible, or at least hack udmabuf to give you
> large sizes. We've open sourced our production devmem TCP userspace:
> 
> https://github.com/google/tcpgpudmarxd
> https://github.com/google/nccl-plugin-gpudirecttcpx
> 
> Porting it to upstream APIs + your dmabuf provider will have you run
> much more interesting tests than anything you do with udmabuf I think,
> unless you hack the udmabuf size.

I found these a bit too late, so I reimplemented the plugin over
upstream APIs :-[ Plus, you yourself have acked [0], guess why
I sent this patch :-D Once the tx part is accepted, we'll upstream
kperf cuda support as well..

0: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8b9049af8066b4705d83bb7847ee3c960fc58d09

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24 22:10       ` Stanislav Fomichev
@ 2025-04-24 22:26         ` Mina Almasry
  2025-04-24 22:40           ` Stanislav Fomichev
  0 siblings, 1 reply; 15+ messages in thread
From: Mina Almasry @ 2025-04-24 22:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On Thu, Apr 24, 2025 at 3:10 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 04/24, Mina Almasry wrote:
> > On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 04/23, Mina Almasry wrote:
> > > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > > > >
> > > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > > should have enough memory in the pool to satisfy projected allocation
> > > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > > (e.g. during NAPI poll).
> > > > >
> > > >
> > > > My understanding is that if the RX ring refill fails, the driver will
> > > > post the buffers it was able to allocate data for, and will not post
> > > > other buffers. So it will run with a degraded performance but nothing
> > > > overly bad should happen. This should be the same behavior if the
> > > > machine is under memory pressure.
> > > >
> > > > In general I don't know about this change. If the user wants to use
> > > > very small dmabufs, they should be able to, without going through
> > > > hoops reducing the number of rx ring slots the driver has (if it
> > > > supports configuring that).
> > > >
> > > > I think maybe printing an error or warning that the dmabuf is too
> > > > small for the pool_size may be fine. But outright failing this
> > > > configuration? I don't think so.
> > > >
> > > > > This commit adds a check at dmabuf pool init time that compares the
> > > > > amount of memory in the underlying chunk pool (configured by the user
> > > > > space application providing dmabuf memory) with the desired pool size
> > > > > (previously set by the driver) and fails with an error message if chunk
> > > > > memory isn't enough.
> > > > >
> > > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > > ---
> > > > >  net/core/devmem.c | 11 +++++++++++
> > > > >  1 file changed, 11 insertions(+)
> > > > >
> > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > > --- a/net/core/devmem.c
> > > > > +++ b/net/core/devmem.c
> > > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > >  {
> > > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > > +       size_t size;
> > > > >
> > > > >         if (!binding)
> > > > >                 return -EINVAL;
> > > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > >         if (pool->p.order != 0)
> > > > >                 return -E2BIG;
> > > > >
> > > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > > +        * requested pool size.
> > > > > +        */
> > > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > > +       if (size < pool->p.pool_size) {
> > > >
> > > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > > not some upper or lower bound on the amount of memory the page_pool
> > > > can provide. So this check seems useless? The page_pool can still not
> > > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > > recycled fast enough) or with normal memory (under memory pressure).
> > >
> > > I read this check more as "is there enough chunks in the binding to
> > > fully fill in the page pool". User controls the size of rx ring
> >
> > Only on drivers that support ethtool -G, and where it will let you
> > configure -G to what you want.
>
> gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
> the rings.
>

GVE supports resizing rings. Other drivers may not. Even on drivers
that support resizing rings. Some users may have a use case for a
dmabuf smaller than the minimum ring size their driver accepts.

> > > which
> > > controls the size of the page pool which somewhat dictates the minimal
> > > size of the binding (maybe).
> >
> > See the test I ran in the other thread. Seems at least GVE is fine
> > with dmabuf size < ring size. I don't know what other drivers do, but
> > generally speaking I think specific driver limitations should not
> > limit what others can do with their drivers. Sure for the GPU mem
> > applications you're probably looking at the dmabufs are huge and
> > supporting small dmabufs is not a concern, but someone somewhere may
> > want to run with 1 MB dmabuf for some use case and if their driver is
> > fine with it, core should not prevent it, I think.
> >
> > > So it's more of a sanity check.
> > >
> > > Maybe having better defaults in ncdevmem would've been a better option? It
> > > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > > PP wants to get 64MB at least..
> >
> > Yeah, udmabuf has a limitation that it only supports 64MB max size
> > last I looked.
>
> We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
> more than 64MB, ncdevmem can change it.

The udmabuf limit is hardcoded, in udmabuf.c or configured on module
load, and ncdevmem doesn't load udmabuf. I guess it could be changed
to that, but currently ncdevmem works with CONFIG_UDMABUF=y

> Or warn the user similar
> to what kperf does: https://github.com/facebookexperimental/kperf/blob/main/devmem.c#L308
>
> So either having a kernel warn or tuning 63MB up to something sensible
> (1G?) should prevent people from going through the pain..
>

Agreed with both. Another option is updating the devmem.rst docs:

"Some drivers may struggle to run devmem TCP when the dmabuf size is
too small, especially when it's smaller than the number of rx ring
slots. Look for this warning in dmesg." etc.

But I don't see the need to outright disable this "dmabuf size < ring
size" use case for everyone to solve this.

> > I added devmem TCP support with udmabuf selftests to demonstrate that
> > the feature is non-proprietary, not to advertise that devmem tcp +
> > udmabuf is a great combination. udmabuf is actually terrible for
> > devmem TCP. The 64MB limit is way too small for anyone to do anything
> > performant on it and by dmaing into host memory you lose many of the
> > benefits of devmem TCP (lower mem bw + pcie bw utilization).
>
> It would still be nice to have a udmabuf as a properly supported option.
> This can drive the UAPI performance conversions: for example, comparing
> existing tcp rx zerocopy vs MSG_SOCK_DEVMEM.. So let's not completely
> dismiss it. We've played internally with doing 2MB udmabuf huge-pages,
> might post it at some point..
>
> > If you're running real experiments with devmem TCP I suggest moving to
> > real dmabufs as soon as possible, or at least hack udmabuf to give you
> > large sizes. We've open sourced our production devmem TCP userspace:
> >
> > https://github.com/google/tcpgpudmarxd
> > https://github.com/google/nccl-plugin-gpudirecttcpx
> >
> > Porting it to upstream APIs + your dmabuf provider will have you run
> > much more interesting tests than anything you do with udmabuf I think,
> > unless you hack the udmabuf size.
>
> I found these a bit too late, so I reimplemented the plugin over
> upstream APIs :-[

Oh, where? Is it open source?

> Plus, you yourself have acked [0], guess why
> I sent this patch :-D Once the tx part is accepted, we'll upstream
> kperf cuda support as well..
>

Cool!

> 0: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8b9049af8066b4705d83bb7847ee3c960fc58d09

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24 22:26         ` Mina Almasry
@ 2025-04-24 22:40           ` Stanislav Fomichev
  2025-04-24 23:42             ` Mina Almasry
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-24 22:40 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On 04/24, Mina Almasry wrote:
> On Thu, Apr 24, 2025 at 3:10 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 04/24, Mina Almasry wrote:
> > > On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 04/23, Mina Almasry wrote:
> > > > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > > > > >
> > > > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > > > should have enough memory in the pool to satisfy projected allocation
> > > > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > > > (e.g. during NAPI poll).
> > > > > >
> > > > >
> > > > > My understanding is that if the RX ring refill fails, the driver will
> > > > > post the buffers it was able to allocate data for, and will not post
> > > > > other buffers. So it will run with a degraded performance but nothing
> > > > > overly bad should happen. This should be the same behavior if the
> > > > > machine is under memory pressure.
> > > > >
> > > > > In general I don't know about this change. If the user wants to use
> > > > > very small dmabufs, they should be able to, without going through
> > > > > hoops reducing the number of rx ring slots the driver has (if it
> > > > > supports configuring that).
> > > > >
> > > > > I think maybe printing an error or warning that the dmabuf is too
> > > > > small for the pool_size may be fine. But outright failing this
> > > > > configuration? I don't think so.
> > > > >
> > > > > > This commit adds a check at dmabuf pool init time that compares the
> > > > > > amount of memory in the underlying chunk pool (configured by the user
> > > > > > space application providing dmabuf memory) with the desired pool size
> > > > > > (previously set by the driver) and fails with an error message if chunk
> > > > > > memory isn't enough.
> > > > > >
> > > > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > > > ---
> > > > > >  net/core/devmem.c | 11 +++++++++++
> > > > > >  1 file changed, 11 insertions(+)
> > > > > >
> > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > > > --- a/net/core/devmem.c
> > > > > > +++ b/net/core/devmem.c
> > > > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > >  {
> > > > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > > > +       size_t size;
> > > > > >
> > > > > >         if (!binding)
> > > > > >                 return -EINVAL;
> > > > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > >         if (pool->p.order != 0)
> > > > > >                 return -E2BIG;
> > > > > >
> > > > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > > > +        * requested pool size.
> > > > > > +        */
> > > > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > > > +       if (size < pool->p.pool_size) {
> > > > >
> > > > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > > > not some upper or lower bound on the amount of memory the page_pool
> > > > > can provide. So this check seems useless? The page_pool can still not
> > > > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > > > recycled fast enough) or with normal memory (under memory pressure).
> > > >
> > > > I read this check more as "is there enough chunks in the binding to
> > > > fully fill in the page pool". User controls the size of rx ring
> > >
> > > Only on drivers that support ethtool -G, and where it will let you
> > > configure -G to what you want.
> >
> > gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
> > the rings.
> >
> 
> GVE supports resizing rings. Other drivers may not. Even on drivers
> that support resizing rings. Some users may have a use case for a
> dmabuf smaller than the minimum ring size their driver accepts.
> 
> > > > which
> > > > controls the size of the page pool which somewhat dictates the minimal
> > > > size of the binding (maybe).
> > >
> > > See the test I ran in the other thread. Seems at least GVE is fine
> > > with dmabuf size < ring size. I don't know what other drivers do, but
> > > generally speaking I think specific driver limitations should not
> > > limit what others can do with their drivers. Sure for the GPU mem
> > > applications you're probably looking at the dmabufs are huge and
> > > supporting small dmabufs is not a concern, but someone somewhere may
> > > want to run with 1 MB dmabuf for some use case and if their driver is
> > > fine with it, core should not prevent it, I think.
> > >
> > > > So it's more of a sanity check.
> > > >
> > > > Maybe having better defaults in ncdevmem would've been a better option? It
> > > > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > > > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > > > PP wants to get 64MB at least..
> > >
> > > Yeah, udmabuf has a limitation that it only supports 64MB max size
> > > last I looked.
> >
> > We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
> > more than 64MB, ncdevmem can change it.
> 
> The udmabuf limit is hardcoded, in udmabuf.c or configured on module
> load, and ncdevmem doesn't load udmabuf. I guess it could be changed
> to that, but currently ncdevmem works with CONFIG_UDMABUF=y

You don't need to load/reload the module to change module params:

# id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys)
# cat /sys/module/udmabuf/parameters/size_limit_mb
64
# echo 128 > /sys/module/udmabuf/parameters/size_limit_mb
# cat /sys/module/udmabuf/parameters/size_limit_mb
128

> > Or warn the user similar
> > to what kperf does: https://github.com/facebookexperimental/kperf/blob/main/devmem.c#L308
> >
> > So either having a kernel warn or tuning 63MB up to something sensible
> > (1G?) should prevent people from going through the pain..
> >
> 
> Agreed with both. Another option is updating the devmem.rst docs:
> 
> "Some drivers may struggle to run devmem TCP when the dmabuf size is
> too small, especially when it's smaller than the number of rx ring
> slots. Look for this warning in dmesg." etc.
> 
> But I don't see the need to outright disable this "dmabuf size < ring
> size" use case for everyone to solve this.

Agreed. The fact that mlx5 has issues with small pp should be fixed,
I'm not arguing with that. I'm trying to understand whether giving
a hint to the user about dmabuf < pp size is helpful or not (because
it will most likely will lead to poor perf, which is the main point
of devmem).

> > > I added devmem TCP support with udmabuf selftests to demonstrate that
> > > the feature is non-proprietary, not to advertise that devmem tcp +
> > > udmabuf is a great combination. udmabuf is actually terrible for
> > > devmem TCP. The 64MB limit is way too small for anyone to do anything
> > > performant on it and by dmaing into host memory you lose many of the
> > > benefits of devmem TCP (lower mem bw + pcie bw utilization).
> >
> > It would still be nice to have a udmabuf as a properly supported option.
> > This can drive the UAPI performance conversions: for example, comparing
> > existing tcp rx zerocopy vs MSG_SOCK_DEVMEM.. So let's not completely
> > dismiss it. We've played internally with doing 2MB udmabuf huge-pages,
> > might post it at some point..
> >
> > > If you're running real experiments with devmem TCP I suggest moving to
> > > real dmabufs as soon as possible, or at least hack udmabuf to give you
> > > large sizes. We've open sourced our production devmem TCP userspace:
> > >
> > > https://github.com/google/tcpgpudmarxd
> > > https://github.com/google/nccl-plugin-gpudirecttcpx
> > >
> > > Porting it to upstream APIs + your dmabuf provider will have you run
> > > much more interesting tests than anything you do with udmabuf I think,
> > > unless you hack the udmabuf size.
> >
> > I found these a bit too late, so I reimplemented the plugin over
> > upstream APIs :-[
> 
> Oh, where? Is it open source?

No, but depending on where we end up I think it should be possible
to make it open source. The is no secret sauce in there..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24 22:40           ` Stanislav Fomichev
@ 2025-04-24 23:42             ` Mina Almasry
  2025-04-25  0:37               ` Stanislav Fomichev
  0 siblings, 1 reply; 15+ messages in thread
From: Mina Almasry @ 2025-04-24 23:42 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On Thu, Apr 24, 2025 at 3:40 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 04/24, Mina Almasry wrote:
> > On Thu, Apr 24, 2025 at 3:10 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 04/24, Mina Almasry wrote:
> > > > On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > > >
> > > > > On 04/23, Mina Almasry wrote:
> > > > > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > > > > > >
> > > > > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > > > > should have enough memory in the pool to satisfy projected allocation
> > > > > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > > > > (e.g. during NAPI poll).
> > > > > > >
> > > > > >
> > > > > > My understanding is that if the RX ring refill fails, the driver will
> > > > > > post the buffers it was able to allocate data for, and will not post
> > > > > > other buffers. So it will run with a degraded performance but nothing
> > > > > > overly bad should happen. This should be the same behavior if the
> > > > > > machine is under memory pressure.
> > > > > >
> > > > > > In general I don't know about this change. If the user wants to use
> > > > > > very small dmabufs, they should be able to, without going through
> > > > > > hoops reducing the number of rx ring slots the driver has (if it
> > > > > > supports configuring that).
> > > > > >
> > > > > > I think maybe printing an error or warning that the dmabuf is too
> > > > > > small for the pool_size may be fine. But outright failing this
> > > > > > configuration? I don't think so.
> > > > > >
> > > > > > > This commit adds a check at dmabuf pool init time that compares the
> > > > > > > amount of memory in the underlying chunk pool (configured by the user
> > > > > > > space application providing dmabuf memory) with the desired pool size
> > > > > > > (previously set by the driver) and fails with an error message if chunk
> > > > > > > memory isn't enough.
> > > > > > >
> > > > > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > > > > ---
> > > > > > >  net/core/devmem.c | 11 +++++++++++
> > > > > > >  1 file changed, 11 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > > > > --- a/net/core/devmem.c
> > > > > > > +++ b/net/core/devmem.c
> > > > > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > > > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > > >  {
> > > > > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > > > > +       size_t size;
> > > > > > >
> > > > > > >         if (!binding)
> > > > > > >                 return -EINVAL;
> > > > > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > > >         if (pool->p.order != 0)
> > > > > > >                 return -E2BIG;
> > > > > > >
> > > > > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > > > > +        * requested pool size.
> > > > > > > +        */
> > > > > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > > > > +       if (size < pool->p.pool_size) {
> > > > > >
> > > > > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > > > > not some upper or lower bound on the amount of memory the page_pool
> > > > > > can provide. So this check seems useless? The page_pool can still not
> > > > > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > > > > recycled fast enough) or with normal memory (under memory pressure).
> > > > >
> > > > > I read this check more as "is there enough chunks in the binding to
> > > > > fully fill in the page pool". User controls the size of rx ring
> > > >
> > > > Only on drivers that support ethtool -G, and where it will let you
> > > > configure -G to what you want.
> > >
> > > gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
> > > the rings.
> > >
> >
> > GVE supports resizing rings. Other drivers may not. Even on drivers
> > that support resizing rings. Some users may have a use case for a
> > dmabuf smaller than the minimum ring size their driver accepts.
> >
> > > > > which
> > > > > controls the size of the page pool which somewhat dictates the minimal
> > > > > size of the binding (maybe).
> > > >
> > > > See the test I ran in the other thread. Seems at least GVE is fine
> > > > with dmabuf size < ring size. I don't know what other drivers do, but
> > > > generally speaking I think specific driver limitations should not
> > > > limit what others can do with their drivers. Sure for the GPU mem
> > > > applications you're probably looking at the dmabufs are huge and
> > > > supporting small dmabufs is not a concern, but someone somewhere may
> > > > want to run with 1 MB dmabuf for some use case and if their driver is
> > > > fine with it, core should not prevent it, I think.
> > > >
> > > > > So it's more of a sanity check.
> > > > >
> > > > > Maybe having better defaults in ncdevmem would've been a better option? It
> > > > > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > > > > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > > > > PP wants to get 64MB at least..
> > > >
> > > > Yeah, udmabuf has a limitation that it only supports 64MB max size
> > > > last I looked.
> > >
> > > We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
> > > more than 64MB, ncdevmem can change it.
> >
> > The udmabuf limit is hardcoded, in udmabuf.c or configured on module
> > load, and ncdevmem doesn't load udmabuf. I guess it could be changed
> > to that, but currently ncdevmem works with CONFIG_UDMABUF=y
>
> You don't need to load/reload the module to change module params:
>
> # id
> uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys)
> # cat /sys/module/udmabuf/parameters/size_limit_mb
> 64
> # echo 128 > /sys/module/udmabuf/parameters/size_limit_mb
> # cat /sys/module/udmabuf/parameters/size_limit_mb
> 128
>

Today I learned! Thanks!

I will put it on my todo list to make ncdevmem force a larger limit to
udmabuf and allocate a larger dmabuf by default. Technically any
dmabuf should be supported, but by default I think probably ncdevmem
should use a .5 GB -> 1GB dmabuf that is more common for these GPU
applications or something. There could be an option as well for folks
to test their driver against smaller dmabufs.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools
  2025-04-24 23:42             ` Mina Almasry
@ 2025-04-25  0:37               ` Stanislav Fomichev
  0 siblings, 0 replies; 15+ messages in thread
From: Stanislav Fomichev @ 2025-04-25  0:37 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Cosmin Ratiu, netdev, Jason Gunthorpe, Leon Romanovsky,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Dragos Tatulea, linux-kselftest

On 04/24, Mina Almasry wrote:
> On Thu, Apr 24, 2025 at 3:40 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 04/24, Mina Almasry wrote:
> > > On Thu, Apr 24, 2025 at 3:10 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 04/24, Mina Almasry wrote:
> > > > > On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > > > >
> > > > > > On 04/23, Mina Almasry wrote:
> > > > > > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > > > > > should have enough memory in the pool to satisfy projected allocation
> > > > > > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > > > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > > > > > (e.g. during NAPI poll).
> > > > > > > >
> > > > > > >
> > > > > > > My understanding is that if the RX ring refill fails, the driver will
> > > > > > > post the buffers it was able to allocate data for, and will not post
> > > > > > > other buffers. So it will run with a degraded performance but nothing
> > > > > > > overly bad should happen. This should be the same behavior if the
> > > > > > > machine is under memory pressure.
> > > > > > >
> > > > > > > In general I don't know about this change. If the user wants to use
> > > > > > > very small dmabufs, they should be able to, without going through
> > > > > > > hoops reducing the number of rx ring slots the driver has (if it
> > > > > > > supports configuring that).
> > > > > > >
> > > > > > > I think maybe printing an error or warning that the dmabuf is too
> > > > > > > small for the pool_size may be fine. But outright failing this
> > > > > > > configuration? I don't think so.
> > > > > > >
> > > > > > > > This commit adds a check at dmabuf pool init time that compares the
> > > > > > > > amount of memory in the underlying chunk pool (configured by the user
> > > > > > > > space application providing dmabuf memory) with the desired pool size
> > > > > > > > (previously set by the driver) and fails with an error message if chunk
> > > > > > > > memory isn't enough.
> > > > > > > >
> > > > > > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > > > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > > > > > ---
> > > > > > > >  net/core/devmem.c | 11 +++++++++++
> > > > > > > >  1 file changed, 11 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > > > > > --- a/net/core/devmem.c
> > > > > > > > +++ b/net/core/devmem.c
> > > > > > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > > > > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > > > >  {
> > > > > > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > > > > > +       size_t size;
> > > > > > > >
> > > > > > > >         if (!binding)
> > > > > > > >                 return -EINVAL;
> > > > > > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > > > >         if (pool->p.order != 0)
> > > > > > > >                 return -E2BIG;
> > > > > > > >
> > > > > > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > > > > > +        * requested pool size.
> > > > > > > > +        */
> > > > > > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > > > > > +       if (size < pool->p.pool_size) {
> > > > > > >
> > > > > > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > > > > > not some upper or lower bound on the amount of memory the page_pool
> > > > > > > can provide. So this check seems useless? The page_pool can still not
> > > > > > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > > > > > recycled fast enough) or with normal memory (under memory pressure).
> > > > > >
> > > > > > I read this check more as "is there enough chunks in the binding to
> > > > > > fully fill in the page pool". User controls the size of rx ring
> > > > >
> > > > > Only on drivers that support ethtool -G, and where it will let you
> > > > > configure -G to what you want.
> > > >
> > > > gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
> > > > the rings.
> > > >
> > >
> > > GVE supports resizing rings. Other drivers may not. Even on drivers
> > > that support resizing rings. Some users may have a use case for a
> > > dmabuf smaller than the minimum ring size their driver accepts.
> > >
> > > > > > which
> > > > > > controls the size of the page pool which somewhat dictates the minimal
> > > > > > size of the binding (maybe).
> > > > >
> > > > > See the test I ran in the other thread. Seems at least GVE is fine
> > > > > with dmabuf size < ring size. I don't know what other drivers do, but
> > > > > generally speaking I think specific driver limitations should not
> > > > > limit what others can do with their drivers. Sure for the GPU mem
> > > > > applications you're probably looking at the dmabufs are huge and
> > > > > supporting small dmabufs is not a concern, but someone somewhere may
> > > > > want to run with 1 MB dmabuf for some use case and if their driver is
> > > > > fine with it, core should not prevent it, I think.
> > > > >
> > > > > > So it's more of a sanity check.
> > > > > >
> > > > > > Maybe having better defaults in ncdevmem would've been a better option? It
> > > > > > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > > > > > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > > > > > PP wants to get 64MB at least..
> > > > >
> > > > > Yeah, udmabuf has a limitation that it only supports 64MB max size
> > > > > last I looked.
> > > >
> > > > We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
> > > > more than 64MB, ncdevmem can change it.
> > >
> > > The udmabuf limit is hardcoded, in udmabuf.c or configured on module
> > > load, and ncdevmem doesn't load udmabuf. I guess it could be changed
> > > to that, but currently ncdevmem works with CONFIG_UDMABUF=y
> >
> > You don't need to load/reload the module to change module params:
> >
> > # id
> > uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys)
> > # cat /sys/module/udmabuf/parameters/size_limit_mb
> > 64
> > # echo 128 > /sys/module/udmabuf/parameters/size_limit_mb
> > # cat /sys/module/udmabuf/parameters/size_limit_mb
> > 128
> >
> 
> Today I learned! Thanks!
> 
> I will put it on my todo list to make ncdevmem force a larger limit to

Or we can ask Cosmin to send something out? Since he's already looking
into the buffer sizes..

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-25  0:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-23 15:35 [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Cosmin Ratiu
2025-04-23 15:35 ` [PATCH net 2/2] tests/ncdevmem: Fix double-free of queue array Cosmin Ratiu
2025-04-23 16:54   ` Stanislav Fomichev
2025-04-23 18:49   ` Mina Almasry
2025-04-23 16:53 ` [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools Stanislav Fomichev
2025-04-23 17:30 ` Mina Almasry
2025-04-23 20:15   ` Stanislav Fomichev
2025-04-24 20:57     ` Mina Almasry
2025-04-24 22:10       ` Stanislav Fomichev
2025-04-24 22:26         ` Mina Almasry
2025-04-24 22:40           ` Stanislav Fomichev
2025-04-24 23:42             ` Mina Almasry
2025-04-25  0:37               ` Stanislav Fomichev
2025-04-24  8:47   ` Cosmin Ratiu
2025-04-24 20:50     ` Mina Almasry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).