[PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
@ 2026-01-16  5:02 Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.

Improvement is ~13% cpu util per RX user thread.

Using kperf, the following results were observed:

Before:
	Average RX worker idle %: 13.13, flows 4, test runs 11
After:
	Average RX worker idle %: 26.32, flows 4, test runs 11

Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.

The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Changes in v10:
- add new tests for edge cases
- add new binding->users to binding for tracking socket/rxq users
- remove rx binding count (use xarray instead)
- Link to v9: https://lore.kernel.org/r/20260109-scratch-bobbyeshleman-devmem-tcp-token-upstream-v9-0-8042930d00d7@meta.com

Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-upstream-v8-0-92c968631496@meta.com

Changes in v8:
- change static branch logic (only set when enabled, otherwise just
  always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-upstream-v7-0-1abc8467354c@meta.com

Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
  with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-0-ea98cf4d40b3@meta.com

Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
  optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
  field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-upstream-v5-0-47cb85f5259e@meta.com

Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com

Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v3-0-084b46bda88f@meta.com

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket
  leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com

Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
  (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com

---
Bobby Eshleman (5):
      net: devmem: rename tx_vec to vec in dmabuf binding
      net: devmem: refactor sock_devmem_dontneed for autorelease split
      net: devmem: implement autorelease token management
      net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
      selftests: drv-net: devmem: add autorelease tests

 Documentation/netlink/specs/netdev.yaml           |  12 ++
 Documentation/networking/devmem.rst               |  73 +++++++++++
 include/net/netmem.h                              |   1 +
 include/net/sock.h                                |   7 +-
 include/uapi/linux/netdev.h                       |   1 +
 net/core/devmem.c                                 | 148 ++++++++++++++++++----
 net/core/devmem.h                                 |  66 +++++++++-
 net/core/netdev-genl-gen.c                        |   5 +-
 net/core/netdev-genl.c                            |  10 +-
 net/core/sock.c                                   | 103 +++++++++++----
 net/ipv4/tcp.c                                    |  87 ++++++++++---
 net/ipv4/tcp_ipv4.c                               |  15 ++-
 net/ipv4/tcp_minisocks.c                          |   3 +-
 tools/include/uapi/linux/netdev.h                 |   1 +
 tools/testing/selftests/drivers/net/hw/devmem.py  |  98 +++++++++++++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c |  68 +++++++++-
 16 files changed, 611 insertions(+), 87 deletions(-)
---
base-commit: d4596891e72cbf155d61798a81ce9d36b69bfaf4
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net-next v10 1/5] net: devmem: rename tx_vec to vec in dmabuf binding
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
@ 2026-01-16  5:02 ` Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'.
This field holds pointers to net_iov structures. The rename prepares for
reusing 'vec' for both TX and RX directions.

No functional change intended.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/devmem.c | 22 +++++++++++-----------
 net/core/devmem.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 185ed2a73d1c..9dee697a28ee 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -85,7 +85,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 	dma_buf_put(binding->dmabuf);
 	xa_destroy(&binding->bound_rxqs);
 	percpu_ref_exit(&binding->ref);
-	kvfree(binding->tx_vec);
+	kvfree(binding->vec);
 	kfree(binding);
 }
 
@@ -246,10 +246,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	}
 
 	if (direction == DMA_TO_DEVICE) {
-		binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-						 sizeof(struct net_iov *),
-						 GFP_KERNEL);
-		if (!binding->tx_vec) {
+		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+					      sizeof(struct net_iov *),
+					      GFP_KERNEL);
+		if (!binding->vec) {
 			err = -ENOMEM;
 			goto err_unmap;
 		}
@@ -263,7 +263,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 					      dev_to_node(&dev->dev));
 	if (!binding->chunk_pool) {
 		err = -ENOMEM;
-		goto err_tx_vec;
+		goto err_vec;
 	}
 
 	virtual = 0;
@@ -309,7 +309,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
 			if (direction == DMA_TO_DEVICE)
-				binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
@@ -329,8 +329,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
 	gen_pool_destroy(binding->chunk_pool);
-err_tx_vec:
-	kvfree(binding->tx_vec);
+err_vec:
+	kvfree(binding->vec);
 err_unmap:
 	dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt,
 					  direction);
@@ -379,7 +379,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
 	int err = 0;
 
 	binding = net_devmem_lookup_dmabuf(dmabuf_id);
-	if (!binding || !binding->tx_vec) {
+	if (!binding || !binding->vec) {
 		err = -EINVAL;
 		goto out_err;
 	}
@@ -430,7 +430,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding,
 	*off = virt_addr % PAGE_SIZE;
 	*size = PAGE_SIZE - *off;
 
-	return binding->tx_vec[virt_addr / PAGE_SIZE];
+	return binding->vec[virt_addr / PAGE_SIZE];
 }
 
 /*** "Dmabuf devmem memory provider" ***/
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 2534c8144212..94874b323520 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding {
 	 * address. This array is convenient to map the virtual addresses to
 	 * net_iovs in the TX path.
 	 */
-	struct net_iov **tx_vec;
+	struct net_iov **vec;
 
 	struct work_struct unbind_w;
 };

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v10 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
@ 2026-01-16  5:02 ` Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 3/5] net: devmem: implement autorelease token management Bobby Eshleman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Refactor sock_devmem_dontneed() in preparation for supporting both
autorelease and manual token release modes.

Split the function into two parts:
- sock_devmem_dontneed(): handles input validation, token allocation,
  and copying from userspace
- sock_devmem_dontneed_autorelease(): performs the actual token release
  via xarray lookup and page pool put

This separation allows a future commit to add a parallel
sock_devmem_dontneed_manual_release() function that uses a different
token tracking mechanism (per-niov reference counting) without
duplicating the input validation logic.

The refactoring is purely mechanical with no functional change. Only
intended to minimize the noise in subsequent patches.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/sock.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index a1c8b47b0d56..f6526f43aa6e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1082,30 +1082,13 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_FRAGS 1024
 
 static noinline_for_stack int
-sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
+				 unsigned int num_tokens)
 {
-	unsigned int num_tokens, i, j, k, netmem_num = 0;
-	struct dmabuf_token *tokens;
+	unsigned int i, j, k, netmem_num = 0;
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	if (!sk_is_tcp(sk))
-		return -EBADF;
-
-	if (optlen % sizeof(*tokens) ||
-	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
-		return -EINVAL;
-
-	num_tokens = optlen / sizeof(*tokens);
-	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
-	if (!tokens)
-		return -ENOMEM;
-
-	if (copy_from_sockptr(tokens, optval, optlen)) {
-		kvfree(tokens);
-		return -EFAULT;
-	}
-
 	xa_lock_bh(&sk->sk_user_frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
@@ -1135,6 +1118,35 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
+	return ret;
+}
+
+static noinline_for_stack int
+sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+{
+	struct dmabuf_token *tokens;
+	unsigned int num_tokens;
+	int ret;
+
+	if (!sk_is_tcp(sk))
+		return -EBADF;
+
+	if (optlen % sizeof(*tokens) ||
+	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
+		return -EINVAL;
+
+	num_tokens = optlen / sizeof(*tokens);
+	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
+	if (!tokens)
+		return -ENOMEM;
+
+	if (copy_from_sockptr(tokens, optval, optlen)) {
+		kvfree(tokens);
+		return -EFAULT;
+	}
+
+	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+
 	kvfree(tokens);
 	return ret;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v10 3/5] net: devmem: implement autorelease token management
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
  2026-01-16  5:02 ` [PATCH net-next v10 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
@ 2026-01-16  5:02 ` Bobby Eshleman
  2026-01-21  1:00   ` Jakub Kicinski
  2026-01-22  4:15   ` Mina Almasry
  2026-01-16  5:02 ` [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add support for autorelease toggling of tokens using a static branch to
control system-wide behavior. This allows applications to choose between
two memory management modes:

1. Autorelease on: Leaked tokens are automatically released when the
   socket closes.

2. Autorelease off: Leaked tokens are released during dmabuf unbind.

The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
binding is disallowed and is rejected by netlink. The system will be
"locked" into the mode that the first binding is set to. It can only be
changed again once there are zero bindings on the system.

Disabling autorelease offers ~13% improvement in CPU utilization.

Static branching is used to limit the system to one mode or the other.

The xa_erase(&net_devmem_dmabuf_bindings, ...) call is moved into
__net_devmem_dmabuf_binding_free(...). The result is that it becomes
possible to switch static branches atomically with regards to xarray
state. In the time window between unbind and free the socket layer can
still find the binding in the xarray, but it will fail to acquire
binding->ref (if unbind decremented to zero). This change preserves
correct behavior and allows us to avoid more complicated counting
schemes for bindings.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v10:
- add binding->users to track socket and rxq users of binding, defer
  release of urefs until binding->users hits zero to guard users from
  incrementing urefs *after* net_devmem_dmabuf_binding_put_urefs()
  is called. (Mina)
- fix error failing to restore static key state when xarray alloc fails
  (Jakub)
- add wrappers for setting/unsetting mode that captures the static key +
  rx binding count logic.
- move xa_erase() into __net_devmem_dmabuf_binding_free()
- remove net_devmem_rx_bindings_count, change xarray management to be
  to avoid the same race as net_devmem_rx_bindings_count did
- check return of net_devmem_dmabuf_binding_get() in
  tcp_recvmsg_dmabuf()
- move sk_devmem_info.binding fiddling into autorelease=off static path

Changes in v9:
- Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
- Add wrapper around tcp_devmem_ar_key accesses so that it may be
  stubbed out when NET_DEVMEM=n
- only dec rx binding count for rx bindings in free (v8 did not exclude
  TX bindings)

Changes in v8:
- Only reset static key when bindings go to zero, defaulting back to
  disabled (Stan).
- Fix bad usage of xarray spinlock for sleepy static branch switching,
  use mutex instead.
- Access pp_ref_count via niov->desc instead of niov directly.
- Move reset of static key to __net_devmem_dmabuf_binding_free() so that
  the static key can not be changed while there are outstanding tokens
  (free is only called when reference count reaches zero).
- Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
  even after xa_erase(), so static key changes must wait until all
  RX bindings are finally freed (not just when xarray is empty). A
  counter is a simple way to track this.
- socket takes reference on the binding, to avoid use-after-free on
  sk_devmem_info.binding in the case that user releases all tokens,
  unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
- removed some comments that were unnecessary

Changes in v7:
- implement autorelease with static branch (Stan)
- use netlink instead of sockopt (Stan)
- merge uAPI and implementation patches into one patch (seemed less
  confusing)

Changes in v6:
- remove sk_devmem_info.autorelease, using binding->autorelease instead
- move binding->autorelease check to outside of
  net_devmem_dmabuf_binding_put_urefs() (Mina)
- remove overly defensive net_is_devmem_iov() (Mina)
- add comment about multiple urefs mapping to a single netmem ref (Mina)
- remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
- use niov without casting back and forth with netmem (Mina)
- move the autorelease flag from per-binding to per-socket (Mina)
- remove the batching logic in sock_devmem_dontneed_manual_release()
  (Mina)
- move autorelease check inside tcp_xa_pool_commit() (Mina)
- remove single-binding restriction for autorelease mode (Mina)
- unbind always checks for leaked urefs

Changes in v5:
- remove unused variables
- introduce autorelease flag, preparing for future patch toggle new
  behavior

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket leaked
  tokens
- drop ethtool patch

Changes in v2:
- always use GFP_ZERO for binding->vec (Mina)
- remove WARN for changed binding (Mina)
- remove extraneous binding ref get (Mina)
- remove WARNs on invalid user input (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- fix length of alloc for urefs
---
 Documentation/netlink/specs/netdev.yaml |  12 +++
 include/net/netmem.h                    |   1 +
 include/net/sock.h                      |   7 +-
 include/uapi/linux/netdev.h             |   1 +
 net/core/devmem.c                       | 136 +++++++++++++++++++++++++++-----
 net/core/devmem.h                       |  64 ++++++++++++++-
 net/core/netdev-genl-gen.c              |   5 +-
 net/core/netdev-genl.c                  |  10 ++-
 net/core/sock.c                         |  57 +++++++++++--
 net/ipv4/tcp.c                          |  87 ++++++++++++++++----
 net/ipv4/tcp_ipv4.c                     |  15 +++-
 net/ipv4/tcp_minisocks.c                |   3 +-
 tools/include/uapi/linux/netdev.h       |   1 +
 13 files changed, 345 insertions(+), 54 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 596c306ce52b..a5301b150663 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -562,6 +562,17 @@ attribute-sets:
         type: u32
         checks:
           min: 1
+      -
+        name: autorelease
+        doc: |
+          Token autorelease mode. If true (1), leaked tokens are automatically
+          released when the socket closes. If false (0), leaked tokens are only
+          released when the dmabuf is torn down. Once a binding is created with
+          a specific mode, all subsequent bindings system-wide must use the
+          same mode.
+
+          Optional. Defaults to false if not specified.
+        type: u8
 
 operations:
   list:
@@ -769,6 +780,7 @@ operations:
             - ifindex
             - fd
             - queues
+            - autorelease
         reply:
           attributes:
             - id
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 9e10f4ac50c3..80d2263ba4ed 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -112,6 +112,7 @@ struct net_iov {
 	};
 	struct net_iov_area *owner;
 	enum net_iov_type type;
+	atomic_t uref;
 };
 
 struct net_iov_area {
diff --git a/include/net/sock.h b/include/net/sock.h
index aafe8bdb2c0f..9d3d5bde15e9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -352,7 +352,7 @@ struct sk_filter;
   *	@sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
   *	@sk_scm_unused: unused flags for scm_recv()
   *	@ns_tracker: tracker for netns reference
-  *	@sk_user_frags: xarray of pages the user is holding a reference on.
+  *	@sk_devmem_info: the devmem binding information for the socket
   *	@sk_owner: reference to the real owner of the socket that calls
   *		   sock_lock_init_class_and_name().
   */
@@ -584,7 +584,10 @@ struct sock {
 	struct numa_drop_counters *sk_drop_counters;
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
-	struct xarray		sk_user_frags;
+	struct {
+		struct xarray				frags;
+		struct net_devmem_dmabuf_binding	*binding;
+	} sk_devmem_info;
 
 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 	struct module		*sk_owner;
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e0b579a1df4f..1e5c209cb998 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -207,6 +207,7 @@ enum {
 	NETDEV_A_DMABUF_QUEUES,
 	NETDEV_A_DMABUF_FD,
 	NETDEV_A_DMABUF_ID,
+	NETDEV_A_DMABUF_AUTORELEASE,
 
 	__NETDEV_A_DMABUF_MAX,
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 9dee697a28ee..1264d8ee40e3 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -11,6 +11,7 @@
 #include <linux/genalloc.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
+#include <linux/skbuff_ref.h>
 #include <linux/types.h>
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
@@ -27,6 +28,9 @@
 /* Device memory support */
 
 static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
+static DEFINE_MUTEX(devmem_ar_lock);
+DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
+EXPORT_SYMBOL(tcp_devmem_ar_key);
 
 static const struct memory_provider_ops dmabuf_devmem_ops;
 
@@ -63,12 +67,71 @@ static void net_devmem_dmabuf_binding_release(struct percpu_ref *ref)
 	schedule_work(&binding->unbind_w);
 }
 
+static bool net_devmem_has_rx_bindings(void)
+{
+	struct net_devmem_dmabuf_binding *binding;
+	unsigned long index;
+
+	lockdep_assert_held(&devmem_ar_lock);
+
+	xa_for_each(&net_devmem_dmabuf_bindings, index, binding) {
+		if (binding->direction == DMA_FROM_DEVICE)
+			return true;
+	}
+	return false;
+}
+
+/* caller must hold devmem_ar_lock */
+static int
+__net_devmem_dmabuf_binding_set_mode(enum dma_data_direction direction,
+				     bool autorelease)
+{
+	lockdep_assert_held(&devmem_ar_lock);
+
+	if (direction != DMA_FROM_DEVICE)
+		return 0;
+
+	if (net_devmem_has_rx_bindings() &&
+	    static_key_enabled(&tcp_devmem_ar_key) != autorelease)
+		return -EBUSY;
+
+	if (autorelease)
+		static_branch_enable(&tcp_devmem_ar_key);
+
+	return 0;
+}
+
+/* caller must hold devmem_ar_lock */
+static void
+__net_devmem_dmabuf_binding_unset_mode(enum dma_data_direction direction)
+{
+	lockdep_assert_held(&devmem_ar_lock);
+
+	if (direction != DMA_FROM_DEVICE)
+		return;
+
+	if (net_devmem_has_rx_bindings())
+		return;
+
+	static_branch_disable(&tcp_devmem_ar_key);
+}
+
 void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 {
 	struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w);
 
 	size_t size, avail;
 
+	mutex_lock(&devmem_ar_lock);
+	xa_erase(&net_devmem_dmabuf_bindings, binding->id);
+	__net_devmem_dmabuf_binding_unset_mode(binding->direction);
+	mutex_unlock(&devmem_ar_lock);
+
+	/* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
+	 * erase.
+	 */
+	synchronize_net();
+
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
 
@@ -126,19 +189,30 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
 	gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
 }
 
+void
+net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
+{
+	int i;
+
+	for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
+		struct net_iov *niov;
+		netmem_ref netmem;
+
+		niov = binding->vec[i];
+		netmem = net_iov_to_netmem(niov);
+
+		/* Multiple urefs map to only a single netmem ref. */
+		if (atomic_xchg(&niov->uref, 0) > 0)
+			WARN_ON_ONCE(!napi_pp_put_page(netmem));
+	}
+}
+
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 {
 	struct netdev_rx_queue *rxq;
 	unsigned long xa_idx;
 	unsigned int rxq_idx;
 
-	xa_erase(&net_devmem_dmabuf_bindings, binding->id);
-
-	/* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
-	 * erase.
-	 */
-	synchronize_net();
-
 	if (binding->list.next)
 		list_del(&binding->list);
 
@@ -151,6 +225,8 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		rxq_idx = get_netdev_rx_queue_index(rxq);
 
 		__net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
+
+		net_devmem_dmabuf_binding_user_put(binding);
 	}
 
 	percpu_ref_kill(&binding->ref);
@@ -178,6 +254,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 	if (err)
 		goto err_close_rxq;
 
+	atomic_inc(&binding->users);
+
 	return 0;
 
 err_close_rxq:
@@ -189,8 +267,10 @@ struct net_devmem_dmabuf_binding *
 net_devmem_bind_dmabuf(struct net_device *dev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
-		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack)
+		       unsigned int dmabuf_fd,
+		       struct netdev_nl_sock *priv,
+		       struct netlink_ext_ack *extack,
+		       bool autorelease)
 {
 	struct net_devmem_dmabuf_binding *binding;
 	static u32 id_alloc_next;
@@ -225,6 +305,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	if (err < 0)
 		goto err_free_binding;
 
+	atomic_set(&binding->users, 0);
+
 	mutex_init(&binding->lock);
 
 	binding->dmabuf = dmabuf;
@@ -245,14 +327,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		goto err_detach;
 	}
 
-	if (direction == DMA_TO_DEVICE) {
-		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-					      sizeof(struct net_iov *),
-					      GFP_KERNEL);
-		if (!binding->vec) {
-			err = -ENOMEM;
-			goto err_unmap;
-		}
+	binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+				      sizeof(struct net_iov *),
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!binding->vec) {
+		err = -ENOMEM;
+		goto err_unmap;
 	}
 
 	/* For simplicity we expect to make PAGE_SIZE allocations, but the
@@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			niov = &owner->area.niovs[i];
 			niov->type = NET_IOV_DMABUF;
 			niov->owner = &owner->area;
+			atomic_set(&niov->uref, 0);
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
-			if (direction == DMA_TO_DEVICE)
-				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+			binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
 	}
 
+	mutex_lock(&devmem_ar_lock);
+
+	err = __net_devmem_dmabuf_binding_set_mode(direction, autorelease);
+	if (err < 0) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "System already configured with autorelease=%d",
+				   static_key_enabled(&tcp_devmem_ar_key));
+		goto err_unlock_mutex;
+	}
+
 	err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
 			      binding, xa_limit_32b, &id_alloc_next,
 			      GFP_KERNEL);
 	if (err < 0)
-		goto err_free_chunks;
+		goto err_unset_mode;
+
+	mutex_unlock(&devmem_ar_lock);
 
 	list_add(&binding->list, &priv->bindings);
 
 	return binding;
 
+err_unset_mode:
+	__net_devmem_dmabuf_binding_unset_mode(direction);
+err_unlock_mutex:
+	mutex_unlock(&devmem_ar_lock);
 err_free_chunks:
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 94874b323520..284f0ad5f381 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -12,9 +12,13 @@
 
 #include <net/netmem.h>
 #include <net/netdev_netlink.h>
+#include <linux/jump_label.h>
 
 struct netlink_ext_ack;
 
+/* static key for TCP devmem autorelease */
+extern struct static_key_false tcp_devmem_ar_key;
+
 struct net_devmem_dmabuf_binding {
 	struct dma_buf *dmabuf;
 	struct dma_buf_attachment *attachment;
@@ -43,6 +47,12 @@ struct net_devmem_dmabuf_binding {
 	 */
 	struct percpu_ref ref;
 
+	/* Counts sockets and rxqs that are using the binding. When this
+	 * reaches zero, all urefs are drained and new sockets cannot join the
+	 * binding.
+	 */
+	atomic_t users;
+
 	/* The list of bindings currently active. Used for netlink to notify us
 	 * of the user dropping the bind.
 	 */
@@ -61,7 +71,7 @@ struct net_devmem_dmabuf_binding {
 
 	/* Array of net_iov pointers for this binding, sorted by virtual
 	 * address. This array is convenient to map the virtual addresses to
-	 * net_iovs in the TX path.
+	 * net_iovs.
 	 */
 	struct net_iov **vec;
 
@@ -88,7 +98,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack);
+		       struct netlink_ext_ack *extack, bool autorelease);
 struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
 int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
@@ -134,6 +144,26 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 	percpu_ref_put(&binding->ref);
 }
 
+void net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding);
+
+static inline bool
+net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
+{
+	return atomic_inc_not_zero(&binding->users);
+}
+
+static inline void
+net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
+{
+	if (atomic_dec_and_test(&binding->users))
+		net_devmem_dmabuf_binding_put_urefs(binding);
+}
+
+static inline bool net_devmem_autorelease_enabled(void)
+{
+	return static_branch_unlikely(&tcp_devmem_ar_key);
+}
+
 void net_devmem_get_net_iov(struct net_iov *niov);
 void net_devmem_put_net_iov(struct net_iov *niov);
 
@@ -151,11 +181,38 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
 #else
 struct net_devmem_dmabuf_binding;
 
+static inline bool
+net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
+{
+	return false;
+}
+
 static inline void
 net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 {
 }
 
+static inline void
+net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
+{
+}
+
+static inline bool
+net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
+{
+	return false;
+}
+
+static inline void
+net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
+{
+}
+
+static inline bool net_devmem_autorelease_enabled(void)
+{
+	return false;
+}
+
 static inline void net_devmem_get_net_iov(struct net_iov *niov)
 {
 }
@@ -170,7 +227,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd,
 		       struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack)
+		       struct netlink_ext_ack *extack,
+		       bool autorelease)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index ba673e81716f..01b7765e11ec 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
 };
 
 /* NETDEV_CMD_BIND_RX - do */
-static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
+static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
 	[NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
 	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
 	[NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
+	[NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
 };
 
 /* NETDEV_CMD_NAPI_SET - do */
@@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.cmd		= NETDEV_CMD_BIND_RX,
 		.doit		= netdev_nl_bind_rx_doit,
 		.policy		= netdev_bind_rx_nl_policy,
-		.maxattr	= NETDEV_A_DMABUF_FD,
+		.maxattr	= NETDEV_A_DMABUF_AUTORELEASE,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 	{
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 470fabbeacd9..c742bb34865e 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	struct netdev_nl_sock *priv;
 	struct net_device *netdev;
 	unsigned long *rxq_bitmap;
+	bool autorelease = false;
 	struct device *dma_dev;
 	struct sk_buff *rsp;
 	int err = 0;
@@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
 	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
 
+	if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
+		autorelease =
+			!!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
+
 	priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
 	if (IS_ERR(priv))
 		return PTR_ERR(priv);
@@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
-					 dmabuf_fd, priv, info->extack);
+					 dmabuf_fd, priv, info->extack,
+					 autorelease);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
 		goto err_rxq_bitmap;
@@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 
 	dma_dev = netdev_queue_get_dma_dev(netdev, 0);
 	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
-					 dmabuf_fd, priv, info->extack);
+					 dmabuf_fd, priv, info->extack, false);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
 		goto err_unlock_netdev;
diff --git a/net/core/sock.c b/net/core/sock.c
index f6526f43aa6e..6355c2ccfb8a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -87,6 +87,7 @@
 
 #include <linux/unaligned.h>
 #include <linux/capability.h>
+#include <linux/dma-buf.h>
 #include <linux/errno.h>
 #include <linux/errqueue.h>
 #include <linux/types.h>
@@ -151,6 +152,7 @@
 #include <uapi/linux/pidfd.h>
 
 #include "dev.h"
+#include "devmem.h"
 
 static DEFINE_MUTEX(proto_list_mutex);
 static LIST_HEAD(proto_list);
@@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_TOKENS 128
 #define MAX_DONTNEED_FRAGS 1024
 
+static noinline_for_stack int
+sock_devmem_dontneed_manual_release(struct sock *sk,
+				    struct dmabuf_token *tokens,
+				    unsigned int num_tokens)
+{
+	struct net_iov *niov;
+	unsigned int i, j;
+	netmem_ref netmem;
+	unsigned int token;
+	int num_frags = 0;
+	int ret = 0;
+
+	if (!sk->sk_devmem_info.binding)
+		return -EINVAL;
+
+	for (i = 0; i < num_tokens; i++) {
+		for (j = 0; j < tokens[i].token_count; j++) {
+			size_t size = sk->sk_devmem_info.binding->dmabuf->size;
+
+			token = tokens[i].token_start + j;
+			if (token >= size / PAGE_SIZE)
+				break;
+
+			if (++num_frags > MAX_DONTNEED_FRAGS)
+				return ret;
+
+			niov = sk->sk_devmem_info.binding->vec[token];
+			if (atomic_dec_and_test(&niov->uref)) {
+				netmem = net_iov_to_netmem(niov);
+				WARN_ON_ONCE(!napi_pp_put_page(netmem));
+			}
+			ret++;
+		}
+	}
+
+	return ret;
+}
+
 static noinline_for_stack int
 sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 				 unsigned int num_tokens)
@@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
 			if (++num_frags > MAX_DONTNEED_FRAGS)
 				goto frag_limit_reached;
 
 			netmem_ref netmem = (__force netmem_ref)__xa_erase(
-				&sk->sk_user_frags, tokens[i].token_start + j);
+				&sk->sk_devmem_info.frags,
+				tokens[i].token_start + j);
 
 			if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
 				continue;
 
 			netmems[netmem_num++] = netmem;
 			if (netmem_num == ARRAY_SIZE(netmems)) {
-				xa_unlock_bh(&sk->sk_user_frags);
+				xa_unlock_bh(&sk->sk_devmem_info.frags);
 				for (k = 0; k < netmem_num; k++)
 					WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 				netmem_num = 0;
-				xa_lock_bh(&sk->sk_user_frags);
+				xa_lock_bh(&sk->sk_devmem_info.frags);
 			}
 			ret++;
 		}
 	}
 
 frag_limit_reached:
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
@@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 		return -EFAULT;
 	}
 
-	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	if (net_devmem_autorelease_enabled())
+		ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	else
+		ret = sock_devmem_dontneed_manual_release(sk, tokens,
+							  num_tokens);
 
 	kvfree(tokens);
 	return ret;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5319ebe2452..73a577bd8765 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -260,6 +260,7 @@
 #include <linux/memblock.h>
 #include <linux/highmem.h>
 #include <linux/cache.h>
+#include <linux/dma-buf.h>
 #include <linux/err.h>
 #include <linux/time.h>
 #include <linux/slab.h>
@@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
 
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	sk_sockets_allocated_inc(sk);
-	xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	sk->sk_devmem_info.binding = NULL;
 }
 EXPORT_IPV6_MOD(tcp_init_sock);
 
@@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 	/* Commit part that has been copied to user space. */
 	for (i = 0; i < p->idx; i++)
-		__xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
-			     (__force void *)p->netmems[i], GFP_KERNEL);
+		__xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
+			     XA_ZERO_ENTRY, (__force void *)p->netmems[i],
+			     GFP_KERNEL);
 	/* Rollback what has been pre-allocated and is no longer needed. */
 	for (; i < p->max; i++)
-		__xa_erase(&sk->sk_user_frags, p->tokens[i]);
+		__xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
 
 	p->max = 0;
 	p->idx = 0;
@@ -2436,14 +2439,17 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
 {
+	if (!net_devmem_autorelease_enabled())
+		return;
+
 	if (!p->max)
 		return;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 }
 
 static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
@@ -2454,24 +2460,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
 	if (p->idx < p->max)
 		return 0;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
 	for (k = 0; k < max_frags; k++) {
-		err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
+		err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
 				 XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
 		if (err)
 			break;
 	}
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 
 	p->max = k;
 	p->idx = 0;
 	return k ? 0 : err;
 }
 
+static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
+					 skb_frag_t *frag)
+{
+	struct net_iov *niov;
+
+	niov = skb_frag_net_iov(frag);
+
+	if (net_devmem_autorelease_enabled()) {
+		atomic_long_inc(&niov->desc.pp_ref_count);
+		tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
+			skb_frag_netmem(frag);
+	} else {
+		if (atomic_inc_return(&niov->uref) == 1)
+			atomic_long_inc(&niov->desc.pp_ref_count);
+	}
+}
+
 /* On error, returns the -errno. On success, returns number of bytes sent to the
  * user. May not consume all of @remaining_len.
  */
@@ -2533,6 +2556,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 		 * sequence of cmsg
 		 */
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct net_devmem_dmabuf_binding *binding = NULL;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 			struct net_iov *niov;
 			u64 frag_offset;
@@ -2568,13 +2592,45 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 					      start;
 				dmabuf_cmsg.frag_offset = frag_offset;
 				dmabuf_cmsg.frag_size = copy;
-				err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
-							 skb_shinfo(skb)->nr_frags - i);
-				if (err)
-					goto out;
+
+				binding = net_devmem_iov_binding(niov);
+
+				if (net_devmem_autorelease_enabled()) {
+					err = tcp_xa_pool_refill(sk,
+								 &tcp_xa_pool,
+								 skb_shinfo(skb)->nr_frags - i);
+					if (err)
+						goto out;
+
+					dmabuf_cmsg.frag_token =
+						tcp_xa_pool.tokens[tcp_xa_pool.idx];
+				} else {
+					if (!sk->sk_devmem_info.binding) {
+						if (!net_devmem_dmabuf_binding_user_get(binding)) {
+							err = -ENODEV;
+							goto out;
+						}
+
+						if (!net_devmem_dmabuf_binding_get(binding)) {
+							net_devmem_dmabuf_binding_user_put(binding);
+							err = -ENODEV;
+							goto out;
+						}
+
+						sk->sk_devmem_info.binding = binding;
+					}
+
+					if (sk->sk_devmem_info.binding != binding) {
+						err = -EFAULT;
+						goto out;
+					}
+
+					dmabuf_cmsg.frag_token =
+						net_iov_virtual_addr(niov) >> PAGE_SHIFT;
+				}
+
 
 				/* Will perform the exchange later */
-				dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
 				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
 
 				offset += copy;
@@ -2587,8 +2643,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 				if (err)
 					goto out;
 
-				atomic_long_inc(&niov->desc.pp_ref_count);
-				tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
+				tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
 
 				sent += copy;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f8a9596e8f4d..420e8c8ebf6d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -89,6 +89,9 @@
 
 #include <crypto/md5.h>
 
+#include <linux/dma-buf.h>
+#include "../core/devmem.h"
+
 #include <trace/events/tcp.h>
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
 	unsigned long index;
 	void *netmem;
 
-	xa_for_each(&sk->sk_user_frags, index, netmem)
+	xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
 		WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
 #endif
 }
@@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
 
 	tcp_release_user_frags(sk);
 
-	xa_destroy(&sk->sk_user_frags);
+	if (!net_devmem_autorelease_enabled() && sk->sk_devmem_info.binding) {
+		net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
+		net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
+		sk->sk_devmem_info.binding = NULL;
+		WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
+			  "non-empty xarray discovered in autorelease off mode");
+	}
+
+	xa_destroy(&sk->sk_devmem_info.frags);
 
 	trace_tcp_destroy_sock(sk);
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd5462154f97..2aec977f5c12 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 
-	xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	newsk->sk_devmem_info.binding = NULL;
 
 	return newsk;
 }
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e0b579a1df4f..1e5c209cb998 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -207,6 +207,7 @@ enum {
 	NETDEV_A_DMABUF_QUEUES,
 	NETDEV_A_DMABUF_FD,
 	NETDEV_A_DMABUF_ID,
+	NETDEV_A_DMABUF_AUTORELEASE,
 
 	__NETDEV_A_DMABUF_MAX,
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (2 preceding siblings ...)
  2026-01-16  5:02 ` [PATCH net-next v10 3/5] net: devmem: implement autorelease token management Bobby Eshleman
@ 2026-01-16  5:02 ` Bobby Eshleman
  2026-01-21  0:36   ` Jakub Kicinski
  2026-01-16  5:02 ` [PATCH net-next v10 5/5] selftests: drv-net: devmem: add autorelease tests Bobby Eshleman
  2026-01-21  1:07 ` [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Jakub Kicinski
  5 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Update devmem.rst documentation to describe the autorelease netlink
attribute used during RX dmabuf binding.

The autorelease attribute is specified at bind-time via the netlink API
(NETDEV_CMD_BIND_RX) and controls what happens to outstanding tokens
when the socket closes.

Document the two token release modes (automatic vs manual), how to
configure the binding for autorelease, the perf benefits, new caveats
and restrictions, and the way the mode is enforced system-wide.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v7:
- Document netlink instead of sockopt
- Mention system-wide locked to one mode
---
 Documentation/networking/devmem.rst | 73 +++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a6cd7236bfbd..f85f1dcc9621 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -235,6 +235,79 @@ can be less than the tokens provided by the user in case of:
 (a) an internal kernel leak bug.
 (b) the user passed more than 1024 frags.
 
+
+Autorelease Control
+~~~~~~~~~~~~~~~~~~~
+
+The autorelease mode controls what happens to outstanding tokens (tokens not
+released via SO_DEVMEM_DONTNEED) when the socket closes. Autorelease is
+configured per-binding at binding creation time via the netlink API::
+
+	struct netdev_bind_rx_req *req;
+	struct netdev_bind_rx_rsp *rsp;
+	struct ynl_sock *ys;
+	struct ynl_error yerr;
+
+	ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+	req = netdev_bind_rx_req_alloc();
+	netdev_bind_rx_req_set_ifindex(req, ifindex);
+	netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+	netdev_bind_rx_req_set_autorelease(req, 0); /* 0 = manual, 1 = auto */
+	__netdev_bind_rx_req_set_queues(req, queues, n_queues);
+
+	rsp = netdev_bind_rx(ys, req);
+
+	dmabuf_id = rsp->id;
+
+When autorelease is disabled (0):
+
+- Outstanding tokens are NOT released when the socket closes
+- Outstanding tokens are only released when all RX queues are unbound AND all
+  sockets that called recvmsg() are closed
+- Provides better performance by eliminating xarray overhead (~13% CPU reduction)
+- Kernel tracks tokens via atomic reference counters in net_iov structures
+
+When autorelease is enabled (1):
+
+- Outstanding tokens are automatically released when the socket closes
+- Backwards compatible behavior
+- Kernel tracks tokens in an xarray per socket
+
+The default is autorelease disabled.
+
+Important: In both modes, applications should call SO_DEVMEM_DONTNEED to
+return tokens as soon as they are done processing. The autorelease setting only
+affects what happens to tokens that are still outstanding when close() is called.
+
+The mode is enforced system-wide. Once a binding is created with a specific
+autorelease mode, all subsequent bindings system-wide must use the same mode.
+
+
+Performance Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Disabling autorelease provides approximately ~13% CPU utilization improvement
+in RX workloads. That said, applications must ensure all tokens are released
+via SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages
+will remain pinned until all RX queues are unbound AND all sockets that called
+recvmsg() are closed.
+
+
+Caveats
+~~~~~~~
+
+- Once a system-wide autorelease mode is selected (via the first binding),
+  all subsequent bindings must use the same mode. Attempts to create bindings
+  with a different mode will be rejected with -EBUSY.
+
+- Applications using manual release mode (autorelease=0) must ensure all tokens
+  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
+  leaks during the lifetime of the dmabuf binding. Tokens not released before
+  close() will only be freed when all RX queues are unbound AND all sockets
+  that called recvmsg() are closed.
+
+
 TX Interface
 ============
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v10 5/5] selftests: drv-net: devmem: add autorelease tests
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (3 preceding siblings ...)
  2026-01-16  5:02 ` [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
@ 2026-01-16  5:02 ` Bobby Eshleman
  2026-01-21  1:07 ` [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Jakub Kicinski
  5 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-16  5:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add test cases for autorelease and new edge cases.

A happy path check_rx / check_rx_autorelease test is added.

The test check_unbind_before_recv/_autorelease tests that after a connection
is accepted, but before recv is called, that unbind behaves correctly.

The test check_unbind_after_recv/_autorelease tests that after a connection
is accepted, but after recv is called, that unbind behaves correctly.

To facilitate the unbind tests, ncdevmem is changed to take an "unbind" mode.

The unbind modes are defined as the following:

UNBIND_MODE_NORMAL: unbind after done with normal traffic

UNBIND_MODE_BEFORE_RECV: Unbind before any recvmsg. The socket hasn't
become a user yet, so binding->users reaches zero and recvmsg should
fail with ENODEV. This validates that sockets can't access devmem after
the binding is torn down.

UNBIND_MODE_AFTER_RECV: Do one recvmsg first (socket becomes a user),
then unbind, then continue receiving. This validates that binding->users
keeps the binding alive for sockets that already acquired a reference
via recvmsg.

ncdevmem is also changed to take an autorelease flag for toggling the
autorelease mode.

TAP version 13
1..8
ok 1 devmem.check_rx
ok 2 devmem.check_rx_autorelease
ok 3 devmem.check_unbind_before_recv
ok 4 devmem.check_unbind_before_recv_autorelease
ok 5 devmem.check_unbind_after_recv
ok 6 devmem.check_unbind_after_recv_autorelease
ok 7 devmem.check_tx
ok 8 devmem.check_tx_chunks

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v10:
- add tests for "unbind before/after recv" edge cases

Changes in v8:
- removed stale/missing tests

Changes in v7:
- use autorelease netlink
- remove sockopt tests
---
 tools/testing/selftests/drivers/net/hw/devmem.py  | 98 ++++++++++++++++++++++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c | 68 ++++++++++++++--
 2 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py
index 45c2d49d55b6..0bbfdf19e23d 100755
--- a/tools/testing/selftests/drivers/net/hw/devmem.py
+++ b/tools/testing/selftests/drivers/net/hw/devmem.py
@@ -25,7 +25,98 @@ def check_rx(cfg) -> None:
 
     port = rand_port()
     socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
-    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 0"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_rx_autorelease(cfg) -> None:
+    """Test devmem TCP receive with autorelease mode enabled."""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 1"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_unbind_before_recv(cfg) -> None:
+    """Test dmabuf unbind before socket recv with autorelease disabled."""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 0 -U 1"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_unbind_before_recv_autorelease(cfg) -> None:
+    """Test dmabuf unbind before socket recv with autorelease enabled."""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 1 -U 1"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_unbind_after_recv(cfg) -> None:
+    """Test dmabuf unbind after socket recv with autorelease disabled."""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 0 -U 2"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_unbind_after_recv_autorelease(cfg) -> None:
+    """Test dmabuf unbind after socket recv with autorelease enabled."""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 1 -U 2"
 
     with bkg(listen_cmd, exit_wait=True) as ncdevmem:
         wait_port_listen(port)
@@ -68,7 +159,10 @@ def main() -> None:
         cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem")
         cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
 
-        ksft_run([check_rx, check_tx, check_tx_chunks],
+        ksft_run([check_rx, check_rx_autorelease,
+                  check_unbind_before_recv, check_unbind_before_recv_autorelease,
+                  check_unbind_after_recv, check_unbind_after_recv_autorelease,
+                  check_tx, check_tx_chunks],
                  args=(cfg, ))
     ksft_exit()
 
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 3288ed04ce08..5cbff3c602b2 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -85,6 +85,13 @@
 
 #define MAX_IOV 1024
 
+enum unbind_mode_type {
+	UNBIND_MODE_NORMAL,
+	UNBIND_MODE_BEFORE_RECV,
+	UNBIND_MODE_AFTER_RECV,
+	UNBIND_MODE_INVAL,
+};
+
 static size_t max_chunk;
 static char *server_ip;
 static char *client_ip;
@@ -92,6 +99,8 @@ static char *port;
 static size_t do_validation;
 static int start_queue = -1;
 static int num_queues = -1;
+static int devmem_autorelease;
+static enum unbind_mode_type unbind_mode;
 static char *ifname;
 static unsigned int ifindex;
 static unsigned int dmabuf_id;
@@ -679,7 +688,8 @@ static int configure_flow_steering(struct sockaddr_in6 *server_sin)
 
 static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
 			 struct netdev_queue_id *queues,
-			 unsigned int n_queue_index, struct ynl_sock **ys)
+			 unsigned int n_queue_index, struct ynl_sock **ys,
+			 int autorelease)
 {
 	struct netdev_bind_rx_req *req = NULL;
 	struct netdev_bind_rx_rsp *rsp = NULL;
@@ -695,6 +705,7 @@ static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
 	req = netdev_bind_rx_req_alloc();
 	netdev_bind_rx_req_set_ifindex(req, ifindex);
 	netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+	netdev_bind_rx_req_set_autorelease(req, autorelease);
 	__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
 
 	rsp = netdev_bind_rx(*ys, req);
@@ -872,7 +883,8 @@ static int do_server(struct memory_buffer *mem)
 		goto err_reset_rss;
 	}
 
-	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) {
+	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys,
+			  devmem_autorelease)) {
 		pr_err("Failed to bind");
 		goto err_reset_flow_steering;
 	}
@@ -922,6 +934,23 @@ static int do_server(struct memory_buffer *mem)
 	fprintf(stderr, "Got connection from %s:%d\n", buffer,
 		ntohs(client_addr.sin6_port));
 
+	if (unbind_mode == UNBIND_MODE_BEFORE_RECV) {
+		struct pollfd pfd = {
+			.fd = client_fd,
+			.events = POLLIN,
+		};
+
+		/* Wait for data then unbind (before recvmsg) */
+		ret = poll(&pfd, 1, 5000);
+		if (ret <= 0) {
+			pr_err("poll failed or timed out waiting for data");
+			goto err_close_client;
+		}
+
+		ynl_sock_destroy(ys);
+		ys = NULL;
+	}
+
 	while (1) {
 		struct iovec iov = { .iov_base = iobuf,
 				     .iov_len = sizeof(iobuf) };
@@ -942,11 +971,19 @@ static int do_server(struct memory_buffer *mem)
 		if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK))
 			continue;
 		if (ret < 0) {
+			if (unbind_mode == UNBIND_MODE_BEFORE_RECV &&
+			    errno == ENODEV)
+				goto cleanup;
+
 			perror("recvmsg");
 			if (errno == EFAULT) {
 				pr_err("received EFAULT, won't recover");
 				goto err_close_client;
 			}
+			if (errno == ENODEV) {
+				pr_err("unexpected ENODEV");
+				goto err_close_client;
+			}
 			continue;
 		}
 		if (ret == 0) {
@@ -1025,6 +1062,11 @@ static int do_server(struct memory_buffer *mem)
 			goto err_close_client;
 		}
 
+		if (unbind_mode == UNBIND_MODE_AFTER_RECV && ys) {
+			ynl_sock_destroy(ys);
+			ys = NULL;
+		}
+
 		fprintf(stderr, "total_received=%lu\n", total_received);
 	}
 
@@ -1043,7 +1085,8 @@ static int do_server(struct memory_buffer *mem)
 err_free_tmp:
 	free(tmp_mem);
 err_unbind:
-	ynl_sock_destroy(ys);
+	if (ys)
+		ynl_sock_destroy(ys);
 err_reset_flow_steering:
 	reset_flow_steering();
 err_reset_rss:
@@ -1092,7 +1135,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Binding empty queues array should have failed");
 		goto err_unbind;
 	}
@@ -1108,7 +1151,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Configure dmabuf with header split off should have failed");
 		goto err_unbind;
 	}
@@ -1124,7 +1167,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Failed to bind");
 		goto err_reset_headersplit;
 	}
@@ -1397,7 +1440,7 @@ int main(int argc, char *argv[])
 	int is_server = 0, opt;
 	int ret, err = 1;
 
-	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:")) != -1) {
+	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:a:U:")) != -1) {
 		switch (opt) {
 		case 'l':
 			is_server = 1;
@@ -1426,6 +1469,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			max_chunk = atoi(optarg);
 			break;
+		case 'a':
+			devmem_autorelease = atoi(optarg);
+			break;
+		case 'U':
+			unbind_mode = atoi(optarg);
+			break;
 		case '?':
 			fprintf(stderr, "unknown option: %c\n", optopt);
 			break;
@@ -1437,6 +1486,11 @@ int main(int argc, char *argv[])
 		return 1;
 	}
 
+	if (unbind_mode >= UNBIND_MODE_INVAL) {
+		pr_err("invalid unbind mode %u\n", unbind_mode);
+		return 1;
+	}
+
 	ifindex = if_nametoindex(ifname);
 
 	fprintf(stderr, "using ifindex=%u\n", ifindex);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-16  5:02 ` [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
@ 2026-01-21  0:36   ` Jakub Kicinski
  2026-01-21  5:44     ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-21  0:36 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote:
> +- Once a system-wide autorelease mode is selected (via the first binding),
> +  all subsequent bindings must use the same mode. Attempts to create bindings
> +  with a different mode will be rejected with -EBUSY.

Why?

> +- Applications using manual release mode (autorelease=0) must ensure all tokens
> +  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
> +  leaks during the lifetime of the dmabuf binding. Tokens not released before
> +  close() will only be freed when all RX queues are unbound AND all sockets
> +  that called recvmsg() are closed.

Could you add a short example on how? by calling shutdown()?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 3/5] net: devmem: implement autorelease token management
  2026-01-16  5:02 ` [PATCH net-next v10 3/5] net: devmem: implement autorelease token management Bobby Eshleman
@ 2026-01-21  1:00   ` Jakub Kicinski
  2026-01-21  5:33     ` Bobby Eshleman
  2026-01-22  4:15   ` Mina Almasry
  1 sibling, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-21  1:00 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Thu, 15 Jan 2026 21:02:14 -0800 Bobby Eshleman wrote:
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 596c306ce52b..a5301b150663 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -562,6 +562,17 @@ attribute-sets:
>          type: u32
>          checks:
>            min: 1
> +      -
> +        name: autorelease
> +        doc: |
> +          Token autorelease mode. If true (1), leaked tokens are automatically
> +          released when the socket closes. If false (0), leaked tokens are only
> +          released when the dmabuf is torn down. Once a binding is created with
> +          a specific mode, all subsequent bindings system-wide must use the
> +          same mode.
> +
> +          Optional. Defaults to false if not specified.
> +        type: u8

if you plan to have more values - u32, if not - flag
u8 is 8b value + 24b of padding, it's only useful for proto fields

>  operations:
>    list:
> @@ -769,6 +780,7 @@ operations:
>              - ifindex
>              - fd
>              - queues
> +            - autorelease
>          reply:
>            attributes:
>              - id

>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> +static DEFINE_MUTEX(devmem_ar_lock);
> +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> +EXPORT_SYMBOL(tcp_devmem_ar_key);

I don't think you need the export, perhaps move the helper in here in
the first place (while keeping the static inline wrapper when devmem=n)?

> +	if (autorelease)
> +		static_branch_enable(&tcp_devmem_ar_key);

This is user-controlled (non-root), right? So I think we need 
the deferred version of key helpers. 

> -	if (direction == DMA_TO_DEVICE) {
> -		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> -					      sizeof(struct net_iov *),
> -					      GFP_KERNEL);
> -		if (!binding->vec) {
> -			err = -ENOMEM;
> -			goto err_unmap;
> -		}
> +	binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> +				      sizeof(struct net_iov *),
> +				      GFP_KERNEL | __GFP_ZERO);

make it a kvcalloc() while we're touching it, pls

> +	if (!binding->vec) {
> +		err = -ENOMEM;
> +		goto err_unmap;
>  	}
>  
>  	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> @@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>  			niov = &owner->area.niovs[i];
>  			niov->type = NET_IOV_DMABUF;
>  			niov->owner = &owner->area;
> +			atomic_set(&niov->uref, 0);

Isn't it zero'ed during alloc?

>  			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
>  						      net_devmem_get_dma_addr(niov));
> -			if (direction == DMA_TO_DEVICE)
> -				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> +			binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
>  		}
>  
>  		virtual += len;
>  	}
>  

> +	if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> +		autorelease =
> +			!!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);

nla_get_u8_default() 

>  	priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
>  	if (IS_ERR(priv))
>  		return PTR_ERR(priv);

> +static noinline_for_stack int
> +sock_devmem_dontneed_manual_release(struct sock *sk,
> +				    struct dmabuf_token *tokens,
> +				    unsigned int num_tokens)
> +{
> +	struct net_iov *niov;
> +	unsigned int i, j;
> +	netmem_ref netmem;
> +	unsigned int token;
> +	int num_frags = 0;
> +	int ret = 0;
> +
> +	if (!sk->sk_devmem_info.binding)
> +		return -EINVAL;
> +
> +	for (i = 0; i < num_tokens; i++) {
> +		for (j = 0; j < tokens[i].token_count; j++) {
> +			size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> +
> +			token = tokens[i].token_start + j;
> +			if (token >= size / PAGE_SIZE)
> +				break;
> +
> +			if (++num_frags > MAX_DONTNEED_FRAGS)
> +				return ret;
> +
> +			niov = sk->sk_devmem_info.binding->vec[token];
> +			if (atomic_dec_and_test(&niov->uref)) {

Don't you need something like "atomic dec non zero and test" ?
refcount has refcount_dec_not_one() 🤔️

> +				netmem = net_iov_to_netmem(niov);
> +				WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +			}
> +			ret++;
> +		}

>  frag_limit_reached:
> -	xa_unlock_bh(&sk->sk_user_frags);
> +	xa_unlock_bh(&sk->sk_devmem_info.frags);

may be worth separating the sk_devmem_info change out for clarity

>  	for (k = 0; k < netmem_num; k++)
>  		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));

> @@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
>  
>  	tcp_release_user_frags(sk);
>  
> -	xa_destroy(&sk->sk_user_frags);
> +	if (!net_devmem_autorelease_enabled() && sk->sk_devmem_info.binding) {
> +		net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
> +		net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> +		sk->sk_devmem_info.binding = NULL;
> +		WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
> +			  "non-empty xarray discovered in autorelease off mode");
> +	}
> +
> +	xa_destroy(&sk->sk_devmem_info.frags);

Let's wrap this up in a helper that'll live in devmem.c

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (4 preceding siblings ...)
  2026-01-16  5:02 ` [PATCH net-next v10 5/5] selftests: drv-net: devmem: add autorelease tests Bobby Eshleman
@ 2026-01-21  1:07 ` Jakub Kicinski
  2026-01-21  5:29   ` Bobby Eshleman
  2026-01-22  4:21   ` Mina Almasry
  5 siblings, 2 replies; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-21  1:07 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> This series improves the CPU cost of RX token management by adding an
> attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> binding to avoid the xarray allocator and instead use a per-binding niov
> array and a uref field in niov.
> 
> Improvement is ~13% cpu util per RX user thread.
> 
> Using kperf, the following results were observed:
> 
> Before:
> 	Average RX worker idle %: 13.13, flows 4, test runs 11
> After:
> 	Average RX worker idle %: 26.32, flows 4, test runs 11
> 
> Two other approaches were tested, but with no improvement. Namely, 1)
> using a hashmap for tokens and 2) keeping an xarray of atomic counters
> but using RCU so that the hotpath could be mostly lockless. Neither of
> these approaches proved better than the simple array in terms of CPU.
> 
> The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> optimization. It is an optional attribute and defaults to 0 (i.e.,
> optimization on).

IDK if the cmsg approach is still right for this flow TBH.
IIRC when Stan talked about this a while back we were considering doing
this via Netlink. Anything that proves that the user owns the binding
would work. IIUC the TCP socket in this design just proves that socket
has received a token from a given binding right?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-21  1:07 ` [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Jakub Kicinski
@ 2026-01-21  5:29   ` Bobby Eshleman
  2026-01-22  1:37     ` Jakub Kicinski
  2026-01-22  4:21   ` Mina Almasry
  1 sibling, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-21  5:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 20, 2026 at 05:07:49PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > This series improves the CPU cost of RX token management by adding an
> > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > binding to avoid the xarray allocator and instead use a per-binding niov
> > array and a uref field in niov.
> > 
> > Improvement is ~13% cpu util per RX user thread.
> > 
> > Using kperf, the following results were observed:
> > 
> > Before:
> > 	Average RX worker idle %: 13.13, flows 4, test runs 11
> > After:
> > 	Average RX worker idle %: 26.32, flows 4, test runs 11
> > 
> > Two other approaches were tested, but with no improvement. Namely, 1)
> > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > but using RCU so that the hotpath could be mostly lockless. Neither of
> > these approaches proved better than the simple array in terms of CPU.
> > 
> > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > optimization on).
> 
> IDK if the cmsg approach is still right for this flow TBH.
> IIRC when Stan talked about this a while back we were considering doing
> this via Netlink. Anything that proves that the user owns the binding
> would work. IIUC the TCP socket in this design just proves that socket
> has received a token from a given binding right?

In both designs the owner of the binding starts of as the netlink opener,
and then ownership spreads out to TCP sockets as packets are steered to
them. Tokens are received by the user which gives them a share in the
form of references on the pp and binding. This design follows the same
approach... but I may be misinterpreting what you mean by ownership?

Best,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 3/5] net: devmem: implement autorelease token management
  2026-01-21  1:00   ` Jakub Kicinski
@ 2026-01-21  5:33     ` Bobby Eshleman
  0 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-21  5:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 20, 2026 at 05:00:42PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Jan 2026 21:02:14 -0800 Bobby Eshleman wrote:
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..a5301b150663 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> >          type: u32
> >          checks:
> >            min: 1
> > +      -
> > +        name: autorelease
> > +        doc: |
> > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > +          released when the socket closes. If false (0), leaked tokens are only
> > +          released when the dmabuf is torn down. Once a binding is created with
> > +          a specific mode, all subsequent bindings system-wide must use the
> > +          same mode.
> > +
> > +          Optional. Defaults to false if not specified.
> > +        type: u8
> 
> if you plan to have more values - u32, if not - flag
> u8 is 8b value + 24b of padding, it's only useful for proto fields
> 
> >  operations:
> >    list:
> > @@ -769,6 +780,7 @@ operations:
> >              - ifindex
> >              - fd
> >              - queues
> > +            - autorelease
> >          reply:
> >            attributes:
> >              - id
> 
> >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> 
> I don't think you need the export, perhaps move the helper in here in
> the first place (while keeping the static inline wrapper when devmem=n)?
> 
> > +	if (autorelease)
> > +		static_branch_enable(&tcp_devmem_ar_key);
> 
> This is user-controlled (non-root), right? So I think we need 
> the deferred version of key helpers. 
> 
> > -	if (direction == DMA_TO_DEVICE) {
> > -		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > -					      sizeof(struct net_iov *),
> > -					      GFP_KERNEL);
> > -		if (!binding->vec) {
> > -			err = -ENOMEM;
> > -			goto err_unmap;
> > -		}
> > +	binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > +				      sizeof(struct net_iov *),
> > +				      GFP_KERNEL | __GFP_ZERO);
> 
> make it a kvcalloc() while we're touching it, pls
> 
> > +	if (!binding->vec) {
> > +		err = -ENOMEM;
> > +		goto err_unmap;
> >  	}
> >  
> >  	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> > @@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >  			niov = &owner->area.niovs[i];
> >  			niov->type = NET_IOV_DMABUF;
> >  			niov->owner = &owner->area;
> > +			atomic_set(&niov->uref, 0);
> 
> Isn't it zero'ed during alloc?
> 
> >  			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> >  						      net_devmem_get_dma_addr(niov));
> > -			if (direction == DMA_TO_DEVICE)
> > -				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> > +			binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> >  		}
> >  
> >  		virtual += len;
> >  	}
> >  
> 
> > +	if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> > +		autorelease =
> > +			!!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> 
> nla_get_u8_default() 
> 
> >  	priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> >  	if (IS_ERR(priv))
> >  		return PTR_ERR(priv);
> 
> > +static noinline_for_stack int
> > +sock_devmem_dontneed_manual_release(struct sock *sk,
> > +				    struct dmabuf_token *tokens,
> > +				    unsigned int num_tokens)
> > +{
> > +	struct net_iov *niov;
> > +	unsigned int i, j;
> > +	netmem_ref netmem;
> > +	unsigned int token;
> > +	int num_frags = 0;
> > +	int ret = 0;
> > +
> > +	if (!sk->sk_devmem_info.binding)
> > +		return -EINVAL;
> > +
> > +	for (i = 0; i < num_tokens; i++) {
> > +		for (j = 0; j < tokens[i].token_count; j++) {
> > +			size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> > +
> > +			token = tokens[i].token_start + j;
> > +			if (token >= size / PAGE_SIZE)
> > +				break;
> > +
> > +			if (++num_frags > MAX_DONTNEED_FRAGS)
> > +				return ret;
> > +
> > +			niov = sk->sk_devmem_info.binding->vec[token];
> > +			if (atomic_dec_and_test(&niov->uref)) {
> 
> Don't you need something like "atomic dec non zero and test" ?
> refcount has refcount_dec_not_one() 🤔️
> 

Good point, that would be better for sure.

> > +				netmem = net_iov_to_netmem(niov);
> > +				WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +			}
> > +			ret++;
> > +		}
> 
> >  frag_limit_reached:
> > -	xa_unlock_bh(&sk->sk_user_frags);
> > +	xa_unlock_bh(&sk->sk_devmem_info.frags);
> 
> may be worth separating the sk_devmem_info change out for clarity
> 
> >  	for (k = 0; k < netmem_num; k++)
> >  		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> 
> > @@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
> >  
> >  	tcp_release_user_frags(sk);
> >  
> > -	xa_destroy(&sk->sk_user_frags);
> > +	if (!net_devmem_autorelease_enabled() && sk->sk_devmem_info.binding) {
> > +		net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
> > +		net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> > +		sk->sk_devmem_info.binding = NULL;
> > +		WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
> > +			  "non-empty xarray discovered in autorelease off mode");
> > +	}
> > +
> > +	xa_destroy(&sk->sk_devmem_info.frags);
> 
> Let's wrap this up in a helper that'll live in devmem.c

All of the above SGTM!

Thanks,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-21  0:36   ` Jakub Kicinski
@ 2026-01-21  5:44     ` Bobby Eshleman
  2026-01-22  1:35       ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-21  5:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote:
> > +- Once a system-wide autorelease mode is selected (via the first binding),
> > +  all subsequent bindings must use the same mode. Attempts to create bindings
> > +  with a different mode will be rejected with -EBUSY.
> 
> Why?
> 

Originally I was using EINVAL, but when writing the tests I noticed this
might be a confusing case for users to interpret EINVAL (i.e., some
binding possibly made by someone else is in a different mode). I thought
EBUSY could capture the semantic "the system is locked up in a different
mode, try again when it isn't".

I'm not married to it though. Happy to go back to EINVAL or another
errno.

> > +- Applications using manual release mode (autorelease=0) must ensure all tokens
> > +  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
> > +  leaks during the lifetime of the dmabuf binding. Tokens not released before
> > +  close() will only be freed when all RX queues are unbound AND all sockets
> > +  that called recvmsg() are closed.
> 
> Could you add a short example on how? by calling shutdown()?

Show an example of the three steps: returning the tokens, unbinding, and closing the
sockets (TCP/NL)?

Best,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-21  5:44     ` Bobby Eshleman
@ 2026-01-22  1:35       ` Jakub Kicinski
  2026-01-22  2:37         ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-22  1:35 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, 20 Jan 2026 21:44:09 -0800 Bobby Eshleman wrote:
> On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote:
> > On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote:  
> > > +- Once a system-wide autorelease mode is selected (via the first binding),
> > > +  all subsequent bindings must use the same mode. Attempts to create bindings
> > > +  with a different mode will be rejected with -EBUSY.  
> > 
> > Why?
> 
> Originally I was using EINVAL, but when writing the tests I noticed this
> might be a confusing case for users to interpret EINVAL (i.e., some
> binding possibly made by someone else is in a different mode). I thought
> EBUSY could capture the semantic "the system is locked up in a different
> mode, try again when it isn't".
> 
> I'm not married to it though. Happy to go back to EINVAL or another
> errno.

My question was more why the system-wide policy exists, rather than
binding-by-binding. Naively I'd think that a single socket must pick
but system wide there could easily be multiple bindings not bothering
each other, doing different things?

> > > +- Applications using manual release mode (autorelease=0) must ensure all tokens
> > > +  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
> > > +  leaks during the lifetime of the dmabuf binding. Tokens not released before
> > > +  close() will only be freed when all RX queues are unbound AND all sockets
> > > +  that called recvmsg() are closed.  
> > 
> > Could you add a short example on how? by calling shutdown()?  
> 
> Show an example of the three steps: returning the tokens, unbinding, and closing the
> sockets (TCP/NL)?

TBH I read the doc before reading the code, which I guess may actually
be better since we don't expect users to read the code first either..

Now after reading the code I'm not sure the doc explains things
properly. AFAIU there's no association of token <> socket within the
same binding. User can close socket A and return the tokens via socket
B. As written the doc made me think that there will be a leak if socket
is closed without releasing tokens, or that there may be a race with
data queued but not read. Neither is true, really?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-21  5:29   ` Bobby Eshleman
@ 2026-01-22  1:37     ` Jakub Kicinski
  0 siblings, 0 replies; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-22  1:37 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, 20 Jan 2026 21:29:36 -0800 Bobby Eshleman wrote:
> > IDK if the cmsg approach is still right for this flow TBH.
> > IIRC when Stan talked about this a while back we were considering doing
> > this via Netlink. Anything that proves that the user owns the binding
> > would work. IIUC the TCP socket in this design just proves that socket
> > has received a token from a given binding right?  
> 
> In both designs the owner of the binding starts of as the netlink opener,
> and then ownership spreads out to TCP sockets as packets are steered to
> them. Tokens are received by the user which gives them a share in the
> form of references on the pp and binding. This design follows the same
> approach... but I may be misinterpreting what you mean by ownership?

What I was getting at was the same point about socket A vs socket B as
I made on the doc patch. IOW the kernel only tracks how many tokens it
gave out for a net_iov, there's no socket state beyond the binding
pointer. Right?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  1:35       ` Jakub Kicinski
@ 2026-01-22  2:37         ` Bobby Eshleman
  2026-01-22  2:50           ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-22  2:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, Jan 21, 2026 at 05:35:12PM -0800, Jakub Kicinski wrote:
> On Tue, 20 Jan 2026 21:44:09 -0800 Bobby Eshleman wrote:
> > On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote:
> > > On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote:  
> > > > +- Once a system-wide autorelease mode is selected (via the first binding),
> > > > +  all subsequent bindings must use the same mode. Attempts to create bindings
> > > > +  with a different mode will be rejected with -EBUSY.  
> > > 
> > > Why?
> > 
> > Originally I was using EINVAL, but when writing the tests I noticed this
> > might be a confusing case for users to interpret EINVAL (i.e., some
> > binding possibly made by someone else is in a different mode). I thought
> > EBUSY could capture the semantic "the system is locked up in a different
> > mode, try again when it isn't".
> > 
> > I'm not married to it though. Happy to go back to EINVAL or another
> > errno.
> 
> My question was more why the system-wide policy exists, rather than
> binding-by-binding. Naively I'd think that a single socket must pick
> but system wide there could easily be multiple bindings not bothering
> each other, doing different things?

Originally we allowed per-binding policy, but it seemed one-per-system
may 1) simplify reasoning through the code by only allowing one policy
per system, and 2) allow simpler deprecation of autorelease=on if its
found to be obsolete over time (just hack off that particular path of
the static branch set). It doesn't prevent any races/bugs or anything.

> 
> > > > +- Applications using manual release mode (autorelease=0) must ensure all tokens
> > > > +  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
> > > > +  leaks during the lifetime of the dmabuf binding. Tokens not released before
> > > > +  close() will only be freed when all RX queues are unbound AND all sockets
> > > > +  that called recvmsg() are closed.  
> > > 
> > > Could you add a short example on how? by calling shutdown()?  
> > 
> > Show an example of the three steps: returning the tokens, unbinding, and closing the
> > sockets (TCP/NL)?
> 
> TBH I read the doc before reading the code, which I guess may actually
> be better since we don't expect users to read the code first either..
> 
> Now after reading the code I'm not sure the doc explains things
> properly. AFAIU there's no association of token <> socket within the
> same binding. User can close socket A and return the tokens via socket
> B. As written the doc made me think that there will be a leak if socket
> is closed without releasing tokens, or that there may be a race with
> data queued but not read. Neither is true, really?

That is correct, neither is true. If the two sockets share a binding the
kernel doesn't care which socket received the token or which one
returned it. No token <> socket association. There is no
queued-but-not-read race either. If any tokens are not returned, as long
as all of the binding references are eventually released and all sockets
that used the binding are closed, then all references will be accounted
for and everything cleaned up.

Best,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  2:37         ` Bobby Eshleman
@ 2026-01-22  2:50           ` Jakub Kicinski
  2026-01-22  3:25             ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-22  2:50 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, 21 Jan 2026 18:37:56 -0800 Bobby Eshleman wrote:
> > > Show an example of the three steps: returning the tokens, unbinding, and closing the
> > > sockets (TCP/NL)?  
> > 
> > TBH I read the doc before reading the code, which I guess may actually
> > be better since we don't expect users to read the code first either..
> > 
> > Now after reading the code I'm not sure the doc explains things
> > properly. AFAIU there's no association of token <> socket within the
> > same binding. User can close socket A and return the tokens via socket
> > B. As written the doc made me think that there will be a leak if socket
> > is closed without releasing tokens, or that there may be a race with
> > data queued but not read. Neither is true, really?  
> 
> That is correct, neither is true. If the two sockets share a binding the
> kernel doesn't care which socket received the token or which one
> returned it. No token <> socket association. There is no
> queued-but-not-read race either. If any tokens are not returned, as long
> as all of the binding references are eventually released and all sockets
> that used the binding are closed, then all references will be accounted
> for and everything cleaned up.

Naming is hard, but I wonder whether the whole feature wouldn't be
better referred to as something to do with global token accounting 
/ management? AUTORELEASE makes sense but seems like focusing on one
particular side effect.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  2:50           ` Jakub Kicinski
@ 2026-01-22  3:25             ` Bobby Eshleman
  2026-01-22  3:46               ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-22  3:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, Jan 21, 2026 at 6:50 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 21 Jan 2026 18:37:56 -0800 Bobby Eshleman wrote:
> > > > Show an example of the three steps: returning the tokens, unbinding, and closing the
> > > > sockets (TCP/NL)?
> > >
> > > TBH I read the doc before reading the code, which I guess may actually
> > > be better since we don't expect users to read the code first either..
> > >
> > > Now after reading the code I'm not sure the doc explains things
> > > properly. AFAIU there's no association of token <> socket within the
> > > same binding. User can close socket A and return the tokens via socket
> > > B. As written the doc made me think that there will be a leak if socket
> > > is closed without releasing tokens, or that there may be a race with
> > > data queued but not read. Neither is true, really?
> >
> > That is correct, neither is true. If the two sockets share a binding the
> > kernel doesn't care which socket received the token or which one
> > returned it. No token <> socket association. There is no
> > queued-but-not-read race either. If any tokens are not returned, as long
> > as all of the binding references are eventually released and all sockets
> > that used the binding are closed, then all references will be accounted
> > for and everything cleaned up.
>
> Naming is hard, but I wonder whether the whole feature wouldn't be
> better referred to as something to do with global token accounting
> / management? AUTORELEASE makes sense but seems like focusing on one
> particular side effect.

Good point. The only real use case for autorelease=on is for backwards
compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  3:25             ` Bobby Eshleman
@ 2026-01-22  3:46               ` Jakub Kicinski
  2026-01-22  4:07                 ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-22  3:46 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell, David Ahern,
	Mina Almasry, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote:
> > > That is correct, neither is true. If the two sockets share a binding the
> > > kernel doesn't care which socket received the token or which one
> > > returned it. No token <> socket association. There is no
> > > queued-but-not-read race either. If any tokens are not returned, as long
> > > as all of the binding references are eventually released and all sockets
> > > that used the binding are closed, then all references will be accounted
> > > for and everything cleaned up.  
> >
> > Naming is hard, but I wonder whether the whole feature wouldn't be
> > better referred to as something to do with global token accounting
> > / management? AUTORELEASE makes sense but seems like focusing on one
> > particular side effect.  
> 
> Good point. The only real use case for autorelease=on is for backwards
> compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
> or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?

Hm. Maybe let's return to naming once we have consensus on the uAPI.

Does everyone think that pushing this via TCP socket opts still makes
sense, even tho in practice the TCP socket is just how we find the
binding?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  3:46               ` Jakub Kicinski
@ 2026-01-22  4:07                 ` Stanislav Fomichev
  2026-01-27  1:26                   ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2026-01-22  4:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On 01/21, Jakub Kicinski wrote:
> On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote:
> > > > That is correct, neither is true. If the two sockets share a binding the
> > > > kernel doesn't care which socket received the token or which one
> > > > returned it. No token <> socket association. There is no
> > > > queued-but-not-read race either. If any tokens are not returned, as long
> > > > as all of the binding references are eventually released and all sockets
> > > > that used the binding are closed, then all references will be accounted
> > > > for and everything cleaned up.  
> > >
> > > Naming is hard, but I wonder whether the whole feature wouldn't be
> > > better referred to as something to do with global token accounting
> > > / management? AUTORELEASE makes sense but seems like focusing on one
> > > particular side effect.  
> > 
> > Good point. The only real use case for autorelease=on is for backwards
> > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
> > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?
> 
> Hm. Maybe let's return to naming once we have consensus on the uAPI.
> 
> Does everyone think that pushing this via TCP socket opts still makes
> sense, even tho in practice the TCP socket is just how we find the
> binding?

I'm not a fan of the existing cmsg scheme, but we already have userspace
using it, so getting more performance out of it seems like an easy win?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 3/5] net: devmem: implement autorelease token management
  2026-01-16  5:02 ` [PATCH net-next v10 3/5] net: devmem: implement autorelease token management Bobby Eshleman
  2026-01-21  1:00   ` Jakub Kicinski
@ 2026-01-22  4:15   ` Mina Almasry
  2026-01-22  5:18     ` Bobby Eshleman
  1 sibling, 1 reply; 34+ messages in thread
From: Mina Almasry @ 2026-01-22  4:15 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Thu, Jan 15, 2026 at 9:03 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@meta.com>
>
> Add support for autorelease toggling of tokens using a static branch to
> control system-wide behavior. This allows applications to choose between
> two memory management modes:
>
> 1. Autorelease on: Leaked tokens are automatically released when the
>    socket closes.
>
> 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
>
> The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> binding is disallowed and is rejected by netlink. The system will be
> "locked" into the mode that the first binding is set to. It can only be
> changed again once there are zero bindings on the system.
>
> Disabling autorelease offers ~13% improvement in CPU utilization.
>
> Static branching is used to limit the system to one mode or the other.
>
> The xa_erase(&net_devmem_dmabuf_bindings, ...) call is moved into
> __net_devmem_dmabuf_binding_free(...). The result is that it becomes
> possible to switch static branches atomically with regards to xarray
> state. In the time window between unbind and free the socket layer can
> still find the binding in the xarray, but it will fail to acquire
> binding->ref (if unbind decremented to zero). This change preserves
> correct behavior and allows us to avoid more complicated counting
> schemes for bindings.
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> ---
> Changes in v10:
> - add binding->users to track socket and rxq users of binding, defer
>   release of urefs until binding->users hits zero to guard users from
>   incrementing urefs *after* net_devmem_dmabuf_binding_put_urefs()
>   is called. (Mina)
> - fix error failing to restore static key state when xarray alloc fails
>   (Jakub)
> - add wrappers for setting/unsetting mode that captures the static key +
>   rx binding count logic.
> - move xa_erase() into __net_devmem_dmabuf_binding_free()
> - remove net_devmem_rx_bindings_count, change xarray management to be
>   to avoid the same race as net_devmem_rx_bindings_count did
> - check return of net_devmem_dmabuf_binding_get() in
>   tcp_recvmsg_dmabuf()
> - move sk_devmem_info.binding fiddling into autorelease=off static path
>
> Changes in v9:
> - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> - Add wrapper around tcp_devmem_ar_key accesses so that it may be
>   stubbed out when NET_DEVMEM=n
> - only dec rx binding count for rx bindings in free (v8 did not exclude
>   TX bindings)
>
> Changes in v8:
> - Only reset static key when bindings go to zero, defaulting back to
>   disabled (Stan).
> - Fix bad usage of xarray spinlock for sleepy static branch switching,
>   use mutex instead.
> - Access pp_ref_count via niov->desc instead of niov directly.
> - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
>   the static key can not be changed while there are outstanding tokens
>   (free is only called when reference count reaches zero).
> - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
>   even after xa_erase(), so static key changes must wait until all
>   RX bindings are finally freed (not just when xarray is empty). A
>   counter is a simple way to track this.
> - socket takes reference on the binding, to avoid use-after-free on
>   sk_devmem_info.binding in the case that user releases all tokens,
>   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> - removed some comments that were unnecessary
>
> Changes in v7:
> - implement autorelease with static branch (Stan)
> - use netlink instead of sockopt (Stan)
> - merge uAPI and implementation patches into one patch (seemed less
>   confusing)
>
> Changes in v6:
> - remove sk_devmem_info.autorelease, using binding->autorelease instead
> - move binding->autorelease check to outside of
>   net_devmem_dmabuf_binding_put_urefs() (Mina)
> - remove overly defensive net_is_devmem_iov() (Mina)
> - add comment about multiple urefs mapping to a single netmem ref (Mina)
> - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> - use niov without casting back and forth with netmem (Mina)
> - move the autorelease flag from per-binding to per-socket (Mina)
> - remove the batching logic in sock_devmem_dontneed_manual_release()
>   (Mina)
> - move autorelease check inside tcp_xa_pool_commit() (Mina)
> - remove single-binding restriction for autorelease mode (Mina)
> - unbind always checks for leaked urefs
>
> Changes in v5:
> - remove unused variables
> - introduce autorelease flag, preparing for future patch toggle new
>   behavior
>
> Changes in v3:
> - make urefs per-binding instead of per-socket, reducing memory
>   footprint
> - fallback to cleaning up references in dmabuf unbind if socket leaked
>   tokens
> - drop ethtool patch
>
> Changes in v2:
> - always use GFP_ZERO for binding->vec (Mina)
> - remove WARN for changed binding (Mina)
> - remove extraneous binding ref get (Mina)
> - remove WARNs on invalid user input (Mina)
> - pre-assign niovs in binding->vec for RX case (Mina)
> - use atomic_set(, 0) to initialize sk_user_frags.urefs
> - fix length of alloc for urefs
> ---
>  Documentation/netlink/specs/netdev.yaml |  12 +++
>  include/net/netmem.h                    |   1 +
>  include/net/sock.h                      |   7 +-
>  include/uapi/linux/netdev.h             |   1 +
>  net/core/devmem.c                       | 136 +++++++++++++++++++++++++++-----
>  net/core/devmem.h                       |  64 ++++++++++++++-
>  net/core/netdev-genl-gen.c              |   5 +-
>  net/core/netdev-genl.c                  |  10 ++-
>  net/core/sock.c                         |  57 +++++++++++--
>  net/ipv4/tcp.c                          |  87 ++++++++++++++++----
>  net/ipv4/tcp_ipv4.c                     |  15 +++-
>  net/ipv4/tcp_minisocks.c                |   3 +-
>  tools/include/uapi/linux/netdev.h       |   1 +
>  13 files changed, 345 insertions(+), 54 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 596c306ce52b..a5301b150663 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -562,6 +562,17 @@ attribute-sets:
>          type: u32
>          checks:
>            min: 1
> +      -
> +        name: autorelease
> +        doc: |
> +          Token autorelease mode. If true (1), leaked tokens are automatically
> +          released when the socket closes. If false (0), leaked tokens are only
> +          released when the dmabuf is torn down. Once a binding is created with
> +          a specific mode, all subsequent bindings system-wide must use the
> +          same mode.
> +
> +          Optional. Defaults to false if not specified.

Ooof. Defaults to false if not specified.

My anxiety with this patch is that running an actual ML training job
involves many layers of middleware where we may not be able to enforce
the "must don'tneeding before closing the socket" requirement. I'm
curious if you have ML jobs or NCCL tests running on this and passing?

> +        type: u8
>
>  operations:
>    list:
> @@ -769,6 +780,7 @@ operations:
>              - ifindex
>              - fd
>              - queues
> +            - autorelease
>          reply:
>            attributes:
>              - id
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 9e10f4ac50c3..80d2263ba4ed 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -112,6 +112,7 @@ struct net_iov {
>         };
>         struct net_iov_area *owner;
>         enum net_iov_type type;
> +       atomic_t uref;
>  };
>
>  struct net_iov_area {
> diff --git a/include/net/sock.h b/include/net/sock.h
> index aafe8bdb2c0f..9d3d5bde15e9 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -352,7 +352,7 @@ struct sk_filter;
>    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
>    *    @sk_scm_unused: unused flags for scm_recv()
>    *    @ns_tracker: tracker for netns reference
> -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> +  *    @sk_devmem_info: the devmem binding information for the socket
>    *    @sk_owner: reference to the real owner of the socket that calls
>    *               sock_lock_init_class_and_name().
>    */
> @@ -584,7 +584,10 @@ struct sock {
>         struct numa_drop_counters *sk_drop_counters;
>         struct rcu_head         sk_rcu;
>         netns_tracker           ns_tracker;
> -       struct xarray           sk_user_frags;
> +       struct {
> +               struct xarray                           frags;
> +               struct net_devmem_dmabuf_binding        *binding;
> +       } sk_devmem_info;
>
>  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
>         struct module           *sk_owner;
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index e0b579a1df4f..1e5c209cb998 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -207,6 +207,7 @@ enum {
>         NETDEV_A_DMABUF_QUEUES,
>         NETDEV_A_DMABUF_FD,
>         NETDEV_A_DMABUF_ID,
> +       NETDEV_A_DMABUF_AUTORELEASE,
>
>         __NETDEV_A_DMABUF_MAX,
>         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 9dee697a28ee..1264d8ee40e3 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/mm.h>
>  #include <linux/netdevice.h>
> +#include <linux/skbuff_ref.h>
>  #include <linux/types.h>
>  #include <net/netdev_queues.h>
>  #include <net/netdev_rx_queue.h>
> @@ -27,6 +28,9 @@
>  /* Device memory support */
>
>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> +static DEFINE_MUTEX(devmem_ar_lock);
> +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> +EXPORT_SYMBOL(tcp_devmem_ar_key);
>
>  static const struct memory_provider_ops dmabuf_devmem_ops;
>
> @@ -63,12 +67,71 @@ static void net_devmem_dmabuf_binding_release(struct percpu_ref *ref)
>         schedule_work(&binding->unbind_w);
>  }
>
> +static bool net_devmem_has_rx_bindings(void)
> +{
> +       struct net_devmem_dmabuf_binding *binding;
> +       unsigned long index;
> +
> +       lockdep_assert_held(&devmem_ar_lock);
> +
> +       xa_for_each(&net_devmem_dmabuf_bindings, index, binding) {
> +               if (binding->direction == DMA_FROM_DEVICE)
> +                       return true;
> +       }
> +       return false;
> +}
> +
> +/* caller must hold devmem_ar_lock */
> +static int
> +__net_devmem_dmabuf_binding_set_mode(enum dma_data_direction direction,
> +                                    bool autorelease)
> +{
> +       lockdep_assert_held(&devmem_ar_lock);
> +
> +       if (direction != DMA_FROM_DEVICE)
> +               return 0;
> +
> +       if (net_devmem_has_rx_bindings() &&
> +           static_key_enabled(&tcp_devmem_ar_key) != autorelease)
> +               return -EBUSY;
> +
> +       if (autorelease)
> +               static_branch_enable(&tcp_devmem_ar_key);
> +
> +       return 0;
> +}
> +
> +/* caller must hold devmem_ar_lock */
> +static void
> +__net_devmem_dmabuf_binding_unset_mode(enum dma_data_direction direction)
> +{
> +       lockdep_assert_held(&devmem_ar_lock);
> +
> +       if (direction != DMA_FROM_DEVICE)
> +               return;
> +
> +       if (net_devmem_has_rx_bindings())
> +               return;
> +
> +       static_branch_disable(&tcp_devmem_ar_key);
> +}
> +
>  void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
>  {
>         struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w);
>
>         size_t size, avail;
>
> +       mutex_lock(&devmem_ar_lock);
> +       xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> +       __net_devmem_dmabuf_binding_unset_mode(binding->direction);
> +       mutex_unlock(&devmem_ar_lock);
> +
> +       /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> +        * erase.
> +        */
> +       synchronize_net();
> +

I'm sorry I could not wrap my head around moving this block of code to
the _free(), even though you mention it in the commit message. Why is
removing the the binding from net_devmem_dmabuf_bindings on unbind not
work for you? If the binding is not active on any rx queues then there
cannot be new data received on it. Unless I'm missing something it
should be fine to leave this where it is.


>         gen_pool_for_each_chunk(binding->chunk_pool,
>                                 net_devmem_dmabuf_free_chunk_owner, NULL);
>
> @@ -126,19 +189,30 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
>         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
>  }
>
> +void
> +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> +{
> +       int i;
> +
> +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> +               struct net_iov *niov;
> +               netmem_ref netmem;
> +
> +               niov = binding->vec[i];
> +               netmem = net_iov_to_netmem(niov);
> +
> +               /* Multiple urefs map to only a single netmem ref. */
> +               if (atomic_xchg(&niov->uref, 0) > 0)
> +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +       }
> +}
> +
>  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>  {
>         struct netdev_rx_queue *rxq;
>         unsigned long xa_idx;
>         unsigned int rxq_idx;
>
> -       xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> -
> -       /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> -        * erase.
> -        */
> -       synchronize_net();
> -
>         if (binding->list.next)
>                 list_del(&binding->list);
>
> @@ -151,6 +225,8 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>                 rxq_idx = get_netdev_rx_queue_index(rxq);
>
>                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> +
> +               net_devmem_dmabuf_binding_user_put(binding);
>         }
>
>         percpu_ref_kill(&binding->ref);
> @@ -178,6 +254,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>         if (err)
>                 goto err_close_rxq;
>
> +       atomic_inc(&binding->users);
> +

nit: feels wrong that we have _binding_user_put() but open code the
get(). Either open code both or helper both, maybe?

>         return 0;
>
>  err_close_rxq:
> @@ -189,8 +267,10 @@ struct net_devmem_dmabuf_binding *
>  net_devmem_bind_dmabuf(struct net_device *dev,
>                        struct device *dma_dev,
>                        enum dma_data_direction direction,
> -                      unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack)
> +                      unsigned int dmabuf_fd,
> +                      struct netdev_nl_sock *priv,
> +                      struct netlink_ext_ack *extack,
> +                      bool autorelease)
>  {
>         struct net_devmem_dmabuf_binding *binding;
>         static u32 id_alloc_next;
> @@ -225,6 +305,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>         if (err < 0)
>                 goto err_free_binding;
>
> +       atomic_set(&binding->users, 0);
> +
>         mutex_init(&binding->lock);
>
>         binding->dmabuf = dmabuf;
> @@ -245,14 +327,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                 goto err_detach;
>         }
>
> -       if (direction == DMA_TO_DEVICE) {
> -               binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> -                                             sizeof(struct net_iov *),
> -                                             GFP_KERNEL);
> -               if (!binding->vec) {
> -                       err = -ENOMEM;
> -                       goto err_unmap;
> -               }
> +       binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> +                                     sizeof(struct net_iov *),
> +                                     GFP_KERNEL | __GFP_ZERO);
> +       if (!binding->vec) {
> +               err = -ENOMEM;
> +               goto err_unmap;
>         }
>
>         /* For simplicity we expect to make PAGE_SIZE allocations, but the
> @@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                         niov = &owner->area.niovs[i];
>                         niov->type = NET_IOV_DMABUF;
>                         niov->owner = &owner->area;
> +                       atomic_set(&niov->uref, 0);
>                         page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
>                                                       net_devmem_get_dma_addr(niov));
> -                       if (direction == DMA_TO_DEVICE)
> -                               binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> +                       binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
>                 }
>
>                 virtual += len;
>         }
>
> +       mutex_lock(&devmem_ar_lock);
> +
> +       err = __net_devmem_dmabuf_binding_set_mode(direction, autorelease);
> +       if (err < 0) {
> +               NL_SET_ERR_MSG_FMT(extack,
> +                                  "System already configured with autorelease=%d",
> +                                  static_key_enabled(&tcp_devmem_ar_key));
> +               goto err_unlock_mutex;
> +       }
> +

Unless I've misread something, this looks very incorrect. TX bindings
will accidentally set the system to autorelease=false mode, no? You
need to make sure you only set the mode in rx-bindings only, right?

>         err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
>                               binding, xa_limit_32b, &id_alloc_next,
>                               GFP_KERNEL);
>         if (err < 0)
> -               goto err_free_chunks;
> +               goto err_unset_mode;
> +
> +       mutex_unlock(&devmem_ar_lock);
>
>         list_add(&binding->list, &priv->bindings);
>
>         return binding;
>
> +err_unset_mode:
> +       __net_devmem_dmabuf_binding_unset_mode(direction);
> +err_unlock_mutex:
> +       mutex_unlock(&devmem_ar_lock);
>  err_free_chunks:
>         gen_pool_for_each_chunk(binding->chunk_pool,
>                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> diff --git a/net/core/devmem.h b/net/core/devmem.h
> index 94874b323520..284f0ad5f381 100644
> --- a/net/core/devmem.h
> +++ b/net/core/devmem.h
> @@ -12,9 +12,13 @@
>
>  #include <net/netmem.h>
>  #include <net/netdev_netlink.h>
> +#include <linux/jump_label.h>
>
>  struct netlink_ext_ack;
>
> +/* static key for TCP devmem autorelease */
> +extern struct static_key_false tcp_devmem_ar_key;
> +
>  struct net_devmem_dmabuf_binding {
>         struct dma_buf *dmabuf;
>         struct dma_buf_attachment *attachment;
> @@ -43,6 +47,12 @@ struct net_devmem_dmabuf_binding {
>          */
>         struct percpu_ref ref;
>
> +       /* Counts sockets and rxqs that are using the binding. When this
> +        * reaches zero, all urefs are drained and new sockets cannot join the
> +        * binding.
> +        */
> +       atomic_t users;
> +
>         /* The list of bindings currently active. Used for netlink to notify us
>          * of the user dropping the bind.
>          */
> @@ -61,7 +71,7 @@ struct net_devmem_dmabuf_binding {
>
>         /* Array of net_iov pointers for this binding, sorted by virtual
>          * address. This array is convenient to map the virtual addresses to
> -        * net_iovs in the TX path.
> +        * net_iovs.
>          */
>         struct net_iov **vec;
>
> @@ -88,7 +98,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                        struct device *dma_dev,
>                        enum dma_data_direction direction,
>                        unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack);
> +                      struct netlink_ext_ack *extack, bool autorelease);
>  struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
>  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
>  int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> @@ -134,6 +144,26 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
>         percpu_ref_put(&binding->ref);
>  }
>
> +void net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding);
> +
> +static inline bool
> +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
> +{
> +       return atomic_inc_not_zero(&binding->users);
> +}
> +
> +static inline void
> +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
> +{
> +       if (atomic_dec_and_test(&binding->users))
> +               net_devmem_dmabuf_binding_put_urefs(binding);
> +}
> +
> +static inline bool net_devmem_autorelease_enabled(void)
> +{
> +       return static_branch_unlikely(&tcp_devmem_ar_key);
> +}
> +
>  void net_devmem_get_net_iov(struct net_iov *niov);
>  void net_devmem_put_net_iov(struct net_iov *niov);
>
> @@ -151,11 +181,38 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
>  #else
>  struct net_devmem_dmabuf_binding;
>
> +static inline bool
> +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> +{
> +       return false;
> +}
> +
>  static inline void
>  net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
>  {
>  }
>
> +static inline void
> +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> +{
> +}
> +
> +static inline bool
> +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
> +{
> +       return false;
> +}
> +
> +static inline void
> +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
> +{
> +}
> +
> +static inline bool net_devmem_autorelease_enabled(void)
> +{
> +       return false;
> +}
> +
>  static inline void net_devmem_get_net_iov(struct net_iov *niov)
>  {
>  }
> @@ -170,7 +227,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                        enum dma_data_direction direction,
>                        unsigned int dmabuf_fd,
>                        struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack)
> +                      struct netlink_ext_ack *extack,
> +                      bool autorelease)
>  {
>         return ERR_PTR(-EOPNOTSUPP);
>  }
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index ba673e81716f..01b7765e11ec 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
>  };
>
>  /* NETDEV_CMD_BIND_RX - do */
> -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
> +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
>         [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
>         [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
>         [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> +       [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },

This patch in the series is proving complicated to review. Any changes
you can refactor out of it for ease of review would be very welcome.
Things like the variable renames, netlink api changes, and any
refactors can be pulled into their own patches would ease review of
the core functionality.

>  };
>
>  /* NETDEV_CMD_NAPI_SET - do */
> @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>                 .cmd            = NETDEV_CMD_BIND_RX,
>                 .doit           = netdev_nl_bind_rx_doit,
>                 .policy         = netdev_bind_rx_nl_policy,
> -               .maxattr        = NETDEV_A_DMABUF_FD,
> +               .maxattr        = NETDEV_A_DMABUF_AUTORELEASE,
>                 .flags          = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>         },
>         {
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 470fabbeacd9..c742bb34865e 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         struct netdev_nl_sock *priv;
>         struct net_device *netdev;
>         unsigned long *rxq_bitmap;
> +       bool autorelease = false;
>         struct device *dma_dev;
>         struct sk_buff *rsp;
>         int err = 0;
> @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
>         dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
>
> +       if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> +               autorelease =
> +                       !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> +
>         priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
>         if (IS_ERR(priv))
>                 return PTR_ERR(priv);
> @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         }
>
>         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> -                                        dmabuf_fd, priv, info->extack);
> +                                        dmabuf_fd, priv, info->extack,
> +                                        autorelease);
>         if (IS_ERR(binding)) {
>                 err = PTR_ERR(binding);
>                 goto err_rxq_bitmap;
> @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>
>         dma_dev = netdev_queue_get_dma_dev(netdev, 0);
>         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> -                                        dmabuf_fd, priv, info->extack);
> +                                        dmabuf_fd, priv, info->extack, false);
>         if (IS_ERR(binding)) {
>                 err = PTR_ERR(binding);
>                 goto err_unlock_netdev;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index f6526f43aa6e..6355c2ccfb8a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -87,6 +87,7 @@
>
>  #include <linux/unaligned.h>
>  #include <linux/capability.h>
> +#include <linux/dma-buf.h>
>  #include <linux/errno.h>
>  #include <linux/errqueue.h>
>  #include <linux/types.h>
> @@ -151,6 +152,7 @@
>  #include <uapi/linux/pidfd.h>
>
>  #include "dev.h"
> +#include "devmem.h"
>
>  static DEFINE_MUTEX(proto_list_mutex);
>  static LIST_HEAD(proto_list);
> @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
>  #define MAX_DONTNEED_TOKENS 128
>  #define MAX_DONTNEED_FRAGS 1024
>
> +static noinline_for_stack int
> +sock_devmem_dontneed_manual_release(struct sock *sk,
> +                                   struct dmabuf_token *tokens,
> +                                   unsigned int num_tokens)
> +{
> +       struct net_iov *niov;
> +       unsigned int i, j;
> +       netmem_ref netmem;
> +       unsigned int token;
> +       int num_frags = 0;
> +       int ret = 0;
> +
> +       if (!sk->sk_devmem_info.binding)
> +               return -EINVAL;
> +
> +       for (i = 0; i < num_tokens; i++) {
> +               for (j = 0; j < tokens[i].token_count; j++) {
> +                       size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> +
> +                       token = tokens[i].token_start + j;
> +                       if (token >= size / PAGE_SIZE)
> +                               break;
> +
> +                       if (++num_frags > MAX_DONTNEED_FRAGS)
> +                               return ret;
> +
> +                       niov = sk->sk_devmem_info.binding->vec[token];
> +                       if (atomic_dec_and_test(&niov->uref)) {
> +                               netmem = net_iov_to_netmem(niov);
> +                               WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +                       }
> +                       ret++;
> +               }
> +       }
> +
> +       return ret;
> +}
> +
>  static noinline_for_stack int
>  sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
>                                  unsigned int num_tokens)
> @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
>         int ret = 0, num_frags = 0;
>         netmem_ref netmems[16];
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>         for (i = 0; i < num_tokens; i++) {
>                 for (j = 0; j < tokens[i].token_count; j++) {
>                         if (++num_frags > MAX_DONTNEED_FRAGS)
>                                 goto frag_limit_reached;
>
>                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
> -                               &sk->sk_user_frags, tokens[i].token_start + j);
> +                               &sk->sk_devmem_info.frags,
> +                               tokens[i].token_start + j);
>
>                         if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
>                                 continue;
>
>                         netmems[netmem_num++] = netmem;
>                         if (netmem_num == ARRAY_SIZE(netmems)) {
> -                               xa_unlock_bh(&sk->sk_user_frags);
> +                               xa_unlock_bh(&sk->sk_devmem_info.frags);
>                                 for (k = 0; k < netmem_num; k++)
>                                         WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
>                                 netmem_num = 0;
> -                               xa_lock_bh(&sk->sk_user_frags);
> +                               xa_lock_bh(&sk->sk_devmem_info.frags);
>                         }
>                         ret++;
>                 }
>         }
>
>  frag_limit_reached:
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>         for (k = 0; k < netmem_num; k++)
>                 WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
>
> @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
>                 return -EFAULT;
>         }
>
> -       ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> +       if (net_devmem_autorelease_enabled())
> +               ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> +       else
> +               ret = sock_devmem_dontneed_manual_release(sk, tokens,
> +                                                         num_tokens);
>
>         kvfree(tokens);
>         return ret;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index d5319ebe2452..73a577bd8765 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -260,6 +260,7 @@
>  #include <linux/memblock.h>
>  #include <linux/highmem.h>
>  #include <linux/cache.h>
> +#include <linux/dma-buf.h>
>  #include <linux/err.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
>
>         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>         sk_sockets_allocated_inc(sk);
> -       xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> +       xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> +       sk->sk_devmem_info.binding = NULL;
>  }
>  EXPORT_IPV6_MOD(tcp_init_sock);
>
> @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
>
>         /* Commit part that has been copied to user space. */
>         for (i = 0; i < p->idx; i++)
> -               __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> -                            (__force void *)p->netmems[i], GFP_KERNEL);
> +               __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> +                            XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> +                            GFP_KERNEL);
>         /* Rollback what has been pre-allocated and is no longer needed. */
>         for (; i < p->max; i++)
> -               __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> +               __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
>
>         p->max = 0;
>         p->idx = 0;
> @@ -2436,14 +2439,17 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
>
>  static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
>  {
> +       if (!net_devmem_autorelease_enabled())
> +               return;
> +
>         if (!p->max)
>                 return;
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>
>         tcp_xa_pool_commit_locked(sk, p);
>
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>  }
>
>  static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> @@ -2454,24 +2460,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
>         if (p->idx < p->max)
>                 return 0;
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>
>         tcp_xa_pool_commit_locked(sk, p);
>
>         for (k = 0; k < max_frags; k++) {
> -               err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> +               err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
>                                  XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
>                 if (err)
>                         break;
>         }
>
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>
>         p->max = k;
>         p->idx = 0;
>         return k ? 0 : err;
>  }
>
> +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> +                                        skb_frag_t *frag)
> +{
> +       struct net_iov *niov;
> +
> +       niov = skb_frag_net_iov(frag);
> +
> +       if (net_devmem_autorelease_enabled()) {
> +               atomic_long_inc(&niov->desc.pp_ref_count);
> +               tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> +                       skb_frag_netmem(frag);
> +       } else {
> +               if (atomic_inc_return(&niov->uref) == 1)
> +                       atomic_long_inc(&niov->desc.pp_ref_count);
> +       }
> +}
> +
>  /* On error, returns the -errno. On success, returns number of bytes sent to the
>   * user. May not consume all of @remaining_len.
>   */
> @@ -2533,6 +2556,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                  * sequence of cmsg
>                  */
>                 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +                       struct net_devmem_dmabuf_binding *binding = NULL;
>                         skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>                         struct net_iov *niov;
>                         u64 frag_offset;
> @@ -2568,13 +2592,45 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                                               start;
>                                 dmabuf_cmsg.frag_offset = frag_offset;
>                                 dmabuf_cmsg.frag_size = copy;
> -                               err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> -                                                        skb_shinfo(skb)->nr_frags - i);
> -                               if (err)
> -                                       goto out;
> +
> +                               binding = net_devmem_iov_binding(niov);
> +
> +                               if (net_devmem_autorelease_enabled()) {
> +                                       err = tcp_xa_pool_refill(sk,
> +                                                                &tcp_xa_pool,
> +                                                                skb_shinfo(skb)->nr_frags - i);
> +                                       if (err)
> +                                               goto out;
> +
> +                                       dmabuf_cmsg.frag_token =
> +                                               tcp_xa_pool.tokens[tcp_xa_pool.idx];
> +                               } else {
> +                                       if (!sk->sk_devmem_info.binding) {
> +                                               if (!net_devmem_dmabuf_binding_user_get(binding)) {
> +                                                       err = -ENODEV;
> +                                                       goto out;
> +                                               }
> +
> +                                               if (!net_devmem_dmabuf_binding_get(binding)) {
> +                                                       net_devmem_dmabuf_binding_user_put(binding);
> +                                                       err = -ENODEV;
> +                                                       goto out;
> +                                               }
> +
> +                                               sk->sk_devmem_info.binding = binding;
> +                                       }
> +
> +                                       if (sk->sk_devmem_info.binding != binding) {
> +                                               err = -EFAULT;
> +                                               goto out;
> +                                       }
> +
> +                                       dmabuf_cmsg.frag_token =
> +                                               net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> +                               }
> +
>
>                                 /* Will perform the exchange later */
> -                               dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
>                                 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
>
>                                 offset += copy;
> @@ -2587,8 +2643,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                                 if (err)
>                                         goto out;
>
> -                               atomic_long_inc(&niov->desc.pp_ref_count);
> -                               tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> +                               tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
>
>                                 sent += copy;
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index f8a9596e8f4d..420e8c8ebf6d 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -89,6 +89,9 @@
>
>  #include <crypto/md5.h>
>
> +#include <linux/dma-buf.h>
> +#include "../core/devmem.h"
> +
>  #include <trace/events/tcp.h>
>
>  #ifdef CONFIG_TCP_MD5SIG
> @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
>         unsigned long index;
>         void *netmem;
>
> -       xa_for_each(&sk->sk_user_frags, index, netmem)
> +       xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
>                 WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
>  #endif
>  }
> @@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
>
>         tcp_release_user_frags(sk);
>
> -       xa_destroy(&sk->sk_user_frags);
> +       if (!net_devmem_autorelease_enabled() && sk->sk_devmem_info.binding) {
> +               net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
> +               net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> +               sk->sk_devmem_info.binding = NULL;
> +               WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
> +                         "non-empty xarray discovered in autorelease off mode");
> +       }
> +
> +       xa_destroy(&sk->sk_devmem_info.frags);
>
>         trace_tcp_destroy_sock(sk);
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index bd5462154f97..2aec977f5c12 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>
>         __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>
> -       xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> +       xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> +       newsk->sk_devmem_info.binding = NULL;
>
>         return newsk;
>  }
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index e0b579a1df4f..1e5c209cb998 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -207,6 +207,7 @@ enum {
>         NETDEV_A_DMABUF_QUEUES,
>         NETDEV_A_DMABUF_FD,
>         NETDEV_A_DMABUF_ID,
> +       NETDEV_A_DMABUF_AUTORELEASE,
>
>         __NETDEV_A_DMABUF_MAX,
>         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
>
> --
> 2.47.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-21  1:07 ` [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Jakub Kicinski
  2026-01-21  5:29   ` Bobby Eshleman
@ 2026-01-22  4:21   ` Mina Almasry
  2026-01-26 18:45     ` Bobby Eshleman
  1 sibling, 1 reply; 34+ messages in thread
From: Mina Almasry @ 2026-01-22  4:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > This series improves the CPU cost of RX token management by adding an
> > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > binding to avoid the xarray allocator and instead use a per-binding niov
> > array and a uref field in niov.
> >
> > Improvement is ~13% cpu util per RX user thread.
> >
> > Using kperf, the following results were observed:
> >
> > Before:
> >       Average RX worker idle %: 13.13, flows 4, test runs 11
> > After:
> >       Average RX worker idle %: 26.32, flows 4, test runs 11
> >
> > Two other approaches were tested, but with no improvement. Namely, 1)
> > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > but using RCU so that the hotpath could be mostly lockless. Neither of
> > these approaches proved better than the simple array in terms of CPU.
> >
> > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > optimization on).
>
> IDK if the cmsg approach is still right for this flow TBH.
> IIRC when Stan talked about this a while back we were considering doing
> this via Netlink. Anything that proves that the user owns the binding
> would work. IIUC the TCP socket in this design just proves that socket
> has received a token from a given binding right?

Doesn't 'doing this via netlink' imply it's a control path operation
that acquires rtnl_lock or netdev_lock or some heavy lock expecting
you to do some config change? Returning tokens is a data-path
operation, IIRC we don't even lock the socket to do it in the
setsockopt.

Is there precedent/path to doing fast data-path operations via netlink?

There may be value in not biting more than we can chew in one series.
Maybe an alternative non-setsockopt dontneeding scheme should be its
own patch series.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 3/5] net: devmem: implement autorelease token management
  2026-01-22  4:15   ` Mina Almasry
@ 2026-01-22  5:18     ` Bobby Eshleman
  0 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-22  5:18 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, Jan 21, 2026 at 08:15:23PM -0800, Mina Almasry wrote:
> On Thu, Jan 15, 2026 at 9:03 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Add support for autorelease toggling of tokens using a static branch to
> > control system-wide behavior. This allows applications to choose between
> > two memory management modes:
> >
> > 1. Autorelease on: Leaked tokens are automatically released when the
> >    socket closes.
> >
> > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> >
> > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > binding is disallowed and is rejected by netlink. The system will be
> > "locked" into the mode that the first binding is set to. It can only be
> > changed again once there are zero bindings on the system.
> >
> > Disabling autorelease offers ~13% improvement in CPU utilization.
> >
> > Static branching is used to limit the system to one mode or the other.
> >
> > The xa_erase(&net_devmem_dmabuf_bindings, ...) call is moved into
> > __net_devmem_dmabuf_binding_free(...). The result is that it becomes
> > possible to switch static branches atomically with regards to xarray
> > state. In the time window between unbind and free the socket layer can
> > still find the binding in the xarray, but it will fail to acquire
> > binding->ref (if unbind decremented to zero). This change preserves
> > correct behavior and allows us to avoid more complicated counting
> > schemes for bindings.
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > ---
> > Changes in v10:
> > - add binding->users to track socket and rxq users of binding, defer
> >   release of urefs until binding->users hits zero to guard users from
> >   incrementing urefs *after* net_devmem_dmabuf_binding_put_urefs()
> >   is called. (Mina)
> > - fix error failing to restore static key state when xarray alloc fails
> >   (Jakub)
> > - add wrappers for setting/unsetting mode that captures the static key +
> >   rx binding count logic.
> > - move xa_erase() into __net_devmem_dmabuf_binding_free()
> > - remove net_devmem_rx_bindings_count, change xarray management to be
> >   to avoid the same race as net_devmem_rx_bindings_count did
> > - check return of net_devmem_dmabuf_binding_get() in
> >   tcp_recvmsg_dmabuf()
> > - move sk_devmem_info.binding fiddling into autorelease=off static path
> >
> > Changes in v9:
> > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> >   stubbed out when NET_DEVMEM=n
> > - only dec rx binding count for rx bindings in free (v8 did not exclude
> >   TX bindings)
> >
> > Changes in v8:
> > - Only reset static key when bindings go to zero, defaulting back to
> >   disabled (Stan).
> > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> >   use mutex instead.
> > - Access pp_ref_count via niov->desc instead of niov directly.
> > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> >   the static key can not be changed while there are outstanding tokens
> >   (free is only called when reference count reaches zero).
> > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> >   even after xa_erase(), so static key changes must wait until all
> >   RX bindings are finally freed (not just when xarray is empty). A
> >   counter is a simple way to track this.
> > - socket takes reference on the binding, to avoid use-after-free on
> >   sk_devmem_info.binding in the case that user releases all tokens,
> >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > - removed some comments that were unnecessary
> >
> > Changes in v7:
> > - implement autorelease with static branch (Stan)
> > - use netlink instead of sockopt (Stan)
> > - merge uAPI and implementation patches into one patch (seemed less
> >   confusing)
> >
> > Changes in v6:
> > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > - move binding->autorelease check to outside of
> >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > - remove overly defensive net_is_devmem_iov() (Mina)
> > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > - use niov without casting back and forth with netmem (Mina)
> > - move the autorelease flag from per-binding to per-socket (Mina)
> > - remove the batching logic in sock_devmem_dontneed_manual_release()
> >   (Mina)
> > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > - remove single-binding restriction for autorelease mode (Mina)
> > - unbind always checks for leaked urefs
> >
> > Changes in v5:
> > - remove unused variables
> > - introduce autorelease flag, preparing for future patch toggle new
> >   behavior
> >
> > Changes in v3:
> > - make urefs per-binding instead of per-socket, reducing memory
> >   footprint
> > - fallback to cleaning up references in dmabuf unbind if socket leaked
> >   tokens
> > - drop ethtool patch
> >
> > Changes in v2:
> > - always use GFP_ZERO for binding->vec (Mina)
> > - remove WARN for changed binding (Mina)
> > - remove extraneous binding ref get (Mina)
> > - remove WARNs on invalid user input (Mina)
> > - pre-assign niovs in binding->vec for RX case (Mina)
> > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > - fix length of alloc for urefs
> > ---
> >  Documentation/netlink/specs/netdev.yaml |  12 +++
> >  include/net/netmem.h                    |   1 +
> >  include/net/sock.h                      |   7 +-
> >  include/uapi/linux/netdev.h             |   1 +
> >  net/core/devmem.c                       | 136 +++++++++++++++++++++++++++-----
> >  net/core/devmem.h                       |  64 ++++++++++++++-
> >  net/core/netdev-genl-gen.c              |   5 +-
> >  net/core/netdev-genl.c                  |  10 ++-
> >  net/core/sock.c                         |  57 +++++++++++--
> >  net/ipv4/tcp.c                          |  87 ++++++++++++++++----
> >  net/ipv4/tcp_ipv4.c                     |  15 +++-
> >  net/ipv4/tcp_minisocks.c                |   3 +-
> >  tools/include/uapi/linux/netdev.h       |   1 +
> >  13 files changed, 345 insertions(+), 54 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..a5301b150663 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> >          type: u32
> >          checks:
> >            min: 1
> > +      -
> > +        name: autorelease
> > +        doc: |
> > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > +          released when the socket closes. If false (0), leaked tokens are only
> > +          released when the dmabuf is torn down. Once a binding is created with
> > +          a specific mode, all subsequent bindings system-wide must use the
> > +          same mode.
> > +
> > +          Optional. Defaults to false if not specified.
> 
> Ooof. Defaults to false if not specified.

I thought we wanted to default to the optimization on?

> 
> My anxiety with this patch is that running an actual ML training job
> involves many layers of middleware where we may not be able to enforce
> the "must don'tneeding before closing the socket" requirement. I'm
> curious if you have ML jobs or NCCL tests running on this and passing?

I can run NCCL on it, but haven't this rev.

> 
> > +        type: u8
> >
> >  operations:
> >    list:
> > @@ -769,6 +780,7 @@ operations:
> >              - ifindex
> >              - fd
> >              - queues
> > +            - autorelease
> >          reply:
> >            attributes:
> >              - id
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 9e10f4ac50c3..80d2263ba4ed 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -112,6 +112,7 @@ struct net_iov {
> >         };
> >         struct net_iov_area *owner;
> >         enum net_iov_type type;
> > +       atomic_t uref;
> >  };
> >
> >  struct net_iov_area {
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,7 +352,7 @@ struct sk_filter;
> >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> >    *    @sk_scm_unused: unused flags for scm_recv()
> >    *    @ns_tracker: tracker for netns reference
> > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > +  *    @sk_devmem_info: the devmem binding information for the socket
> >    *    @sk_owner: reference to the real owner of the socket that calls
> >    *               sock_lock_init_class_and_name().
> >    */
> > @@ -584,7 +584,10 @@ struct sock {
> >         struct numa_drop_counters *sk_drop_counters;
> >         struct rcu_head         sk_rcu;
> >         netns_tracker           ns_tracker;
> > -       struct xarray           sk_user_frags;
> > +       struct {
> > +               struct xarray                           frags;
> > +               struct net_devmem_dmabuf_binding        *binding;
> > +       } sk_devmem_info;
> >
> >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> >         struct module           *sk_owner;
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> >         NETDEV_A_DMABUF_QUEUES,
> >         NETDEV_A_DMABUF_FD,
> >         NETDEV_A_DMABUF_ID,
> > +       NETDEV_A_DMABUF_AUTORELEASE,
> >
> >         __NETDEV_A_DMABUF_MAX,
> >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 9dee697a28ee..1264d8ee40e3 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/genalloc.h>
> >  #include <linux/mm.h>
> >  #include <linux/netdevice.h>
> > +#include <linux/skbuff_ref.h>
> >  #include <linux/types.h>
> >  #include <net/netdev_queues.h>
> >  #include <net/netdev_rx_queue.h>
> > @@ -27,6 +28,9 @@
> >  /* Device memory support */
> >
> >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> >
> >  static const struct memory_provider_ops dmabuf_devmem_ops;
> >
> > @@ -63,12 +67,71 @@ static void net_devmem_dmabuf_binding_release(struct percpu_ref *ref)
> >         schedule_work(&binding->unbind_w);
> >  }
> >
> > +static bool net_devmem_has_rx_bindings(void)
> > +{
> > +       struct net_devmem_dmabuf_binding *binding;
> > +       unsigned long index;
> > +
> > +       lockdep_assert_held(&devmem_ar_lock);
> > +
> > +       xa_for_each(&net_devmem_dmabuf_bindings, index, binding) {
> > +               if (binding->direction == DMA_FROM_DEVICE)
> > +                       return true;
> > +       }
> > +       return false;
> > +}
> > +
> > +/* caller must hold devmem_ar_lock */
> > +static int
> > +__net_devmem_dmabuf_binding_set_mode(enum dma_data_direction direction,
> > +                                    bool autorelease)
> > +{
> > +       lockdep_assert_held(&devmem_ar_lock);
> > +
> > +       if (direction != DMA_FROM_DEVICE)
> > +               return 0;
> > +
> > +       if (net_devmem_has_rx_bindings() &&
> > +           static_key_enabled(&tcp_devmem_ar_key) != autorelease)
> > +               return -EBUSY;
> > +
> > +       if (autorelease)
> > +               static_branch_enable(&tcp_devmem_ar_key);
> > +
> > +       return 0;
> > +}
> > +
> > +/* caller must hold devmem_ar_lock */
> > +static void
> > +__net_devmem_dmabuf_binding_unset_mode(enum dma_data_direction direction)
> > +{
> > +       lockdep_assert_held(&devmem_ar_lock);
> > +
> > +       if (direction != DMA_FROM_DEVICE)
> > +               return;
> > +
> > +       if (net_devmem_has_rx_bindings())
> > +               return;
> > +
> > +       static_branch_disable(&tcp_devmem_ar_key);
> > +}
> > +
> >  void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> >  {
> >         struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w);
> >
> >         size_t size, avail;
> >
> > +       mutex_lock(&devmem_ar_lock);
> > +       xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > +       __net_devmem_dmabuf_binding_unset_mode(binding->direction);
> > +       mutex_unlock(&devmem_ar_lock);
> > +
> > +       /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > +        * erase.
> > +        */
> > +       synchronize_net();
> > +
> 
> I'm sorry I could not wrap my head around moving this block of code to
> the _free(), even though you mention it in the commit message. Why is
> removing the the binding from net_devmem_dmabuf_bindings on unbind not
> work for you? If the binding is not active on any rx queues then there
> cannot be new data received on it. Unless I'm missing something it
> should be fine to leave this where it is.

Hmmm... actually now that recvmsg gets a binding->ref, the race I was
trying to avoid can't happen (that is, dontneed using the wrong static
branch after unbind but before free). Thanks for catching that.

> 
> 
> >         gen_pool_for_each_chunk(binding->chunk_pool,
> >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> >
> > @@ -126,19 +189,30 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> >  }
> >
> > +void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > +               struct net_iov *niov;
> > +               netmem_ref netmem;
> > +
> > +               niov = binding->vec[i];
> > +               netmem = net_iov_to_netmem(niov);
> > +
> > +               /* Multiple urefs map to only a single netmem ref. */
> > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +       }
> > +}
> > +
> >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >  {
> >         struct netdev_rx_queue *rxq;
> >         unsigned long xa_idx;
> >         unsigned int rxq_idx;
> >
> > -       xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > -
> > -       /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > -        * erase.
> > -        */
> > -       synchronize_net();
> > -
> >         if (binding->list.next)
> >                 list_del(&binding->list);
> >
> > @@ -151,6 +225,8 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >                 rxq_idx = get_netdev_rx_queue_index(rxq);
> >
> >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > +
> > +               net_devmem_dmabuf_binding_user_put(binding);
> >         }
> >
> >         percpu_ref_kill(&binding->ref);
> > @@ -178,6 +254,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >         if (err)
> >                 goto err_close_rxq;
> >
> > +       atomic_inc(&binding->users);
> > +
> 
> nit: feels wrong that we have _binding_user_put() but open code the
> get(). Either open code both or helper both, maybe?

I had the same though, but the issue is that for the first queue
atomic_inc_not_zero() (the helper) will fail.

> 
> >         return 0;
> >
> >  err_close_rxq:
> > @@ -189,8 +267,10 @@ struct net_devmem_dmabuf_binding *
> >  net_devmem_bind_dmabuf(struct net_device *dev,
> >                        struct device *dma_dev,
> >                        enum dma_data_direction direction,
> > -                      unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack)
> > +                      unsigned int dmabuf_fd,
> > +                      struct netdev_nl_sock *priv,
> > +                      struct netlink_ext_ack *extack,
> > +                      bool autorelease)
> >  {
> >         struct net_devmem_dmabuf_binding *binding;
> >         static u32 id_alloc_next;
> > @@ -225,6 +305,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >         if (err < 0)
> >                 goto err_free_binding;
> >
> > +       atomic_set(&binding->users, 0);
> > +
> >         mutex_init(&binding->lock);
> >
> >         binding->dmabuf = dmabuf;
> > @@ -245,14 +327,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                 goto err_detach;
> >         }
> >
> > -       if (direction == DMA_TO_DEVICE) {
> > -               binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > -                                             sizeof(struct net_iov *),
> > -                                             GFP_KERNEL);
> > -               if (!binding->vec) {
> > -                       err = -ENOMEM;
> > -                       goto err_unmap;
> > -               }
> > +       binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > +                                     sizeof(struct net_iov *),
> > +                                     GFP_KERNEL | __GFP_ZERO);
> > +       if (!binding->vec) {
> > +               err = -ENOMEM;
> > +               goto err_unmap;
> >         }
> >
> >         /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > @@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                         niov = &owner->area.niovs[i];
> >                         niov->type = NET_IOV_DMABUF;
> >                         niov->owner = &owner->area;
> > +                       atomic_set(&niov->uref, 0);
> >                         page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> >                                                       net_devmem_get_dma_addr(niov));
> > -                       if (direction == DMA_TO_DEVICE)
> > -                               binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> > +                       binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> >                 }
> >
> >                 virtual += len;
> >         }
> >
> > +       mutex_lock(&devmem_ar_lock);
> > +
> > +       err = __net_devmem_dmabuf_binding_set_mode(direction, autorelease);
> > +       if (err < 0) {
> > +               NL_SET_ERR_MSG_FMT(extack,
> > +                                  "System already configured with autorelease=%d",
> > +                                  static_key_enabled(&tcp_devmem_ar_key));
> > +               goto err_unlock_mutex;
> > +       }
> > +
> 
> Unless I've misread something, this looks very incorrect. TX bindings
> will accidentally set the system to autorelease=false mode, no? You
> need to make sure you only set the mode in rx-bindings only, right?

Maybe the context you missed is that
__net_devmem_dmabuf_binding_set_mode() no-ops for TX (DMA_FROM_DEVICE)?

> 
> >         err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> >                               binding, xa_limit_32b, &id_alloc_next,
> >                               GFP_KERNEL);
> >         if (err < 0)
> > -               goto err_free_chunks;
> > +               goto err_unset_mode;
> > +
> > +       mutex_unlock(&devmem_ar_lock);
> >
> >         list_add(&binding->list, &priv->bindings);
> >
> >         return binding;
> >
> > +err_unset_mode:
> > +       __net_devmem_dmabuf_binding_unset_mode(direction);
> > +err_unlock_mutex:
> > +       mutex_unlock(&devmem_ar_lock);
> >  err_free_chunks:
> >         gen_pool_for_each_chunk(binding->chunk_pool,
> >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > diff --git a/net/core/devmem.h b/net/core/devmem.h
> > index 94874b323520..284f0ad5f381 100644
> > --- a/net/core/devmem.h
> > +++ b/net/core/devmem.h
> > @@ -12,9 +12,13 @@
> >
> >  #include <net/netmem.h>
> >  #include <net/netdev_netlink.h>
> > +#include <linux/jump_label.h>
> >
> >  struct netlink_ext_ack;
> >
> > +/* static key for TCP devmem autorelease */
> > +extern struct static_key_false tcp_devmem_ar_key;
> > +
> >  struct net_devmem_dmabuf_binding {
> >         struct dma_buf *dmabuf;
> >         struct dma_buf_attachment *attachment;
> > @@ -43,6 +47,12 @@ struct net_devmem_dmabuf_binding {
> >          */
> >         struct percpu_ref ref;
> >
> > +       /* Counts sockets and rxqs that are using the binding. When this
> > +        * reaches zero, all urefs are drained and new sockets cannot join the
> > +        * binding.
> > +        */
> > +       atomic_t users;
> > +
> >         /* The list of bindings currently active. Used for netlink to notify us
> >          * of the user dropping the bind.
> >          */
> > @@ -61,7 +71,7 @@ struct net_devmem_dmabuf_binding {
> >
> >         /* Array of net_iov pointers for this binding, sorted by virtual
> >          * address. This array is convenient to map the virtual addresses to
> > -        * net_iovs in the TX path.
> > +        * net_iovs.
> >          */
> >         struct net_iov **vec;
> >
> > @@ -88,7 +98,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                        struct device *dma_dev,
> >                        enum dma_data_direction direction,
> >                        unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack);
> > +                      struct netlink_ext_ack *extack, bool autorelease);
> >  struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
> >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
> >  int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > @@ -134,6 +144,26 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> >         percpu_ref_put(&binding->ref);
> >  }
> >
> > +void net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding);
> > +
> > +static inline bool
> > +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       return atomic_inc_not_zero(&binding->users);
> > +}
> > +
> > +static inline void
> > +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       if (atomic_dec_and_test(&binding->users))
> > +               net_devmem_dmabuf_binding_put_urefs(binding);
> > +}
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > +       return static_branch_unlikely(&tcp_devmem_ar_key);
> > +}
> > +
> >  void net_devmem_get_net_iov(struct net_iov *niov);
> >  void net_devmem_put_net_iov(struct net_iov *niov);
> >
> > @@ -151,11 +181,38 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
> >  #else
> >  struct net_devmem_dmabuf_binding;
> >
> > +static inline bool
> > +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void
> >  net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> >  {
> >  }
> >
> > +static inline void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +}
> > +
> > +static inline bool
> > +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void
> > +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +}
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void net_devmem_get_net_iov(struct net_iov *niov)
> >  {
> >  }
> > @@ -170,7 +227,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                        enum dma_data_direction direction,
> >                        unsigned int dmabuf_fd,
> >                        struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack)
> > +                      struct netlink_ext_ack *extack,
> > +                      bool autorelease)
> >  {
> >         return ERR_PTR(-EOPNOTSUPP);
> >  }
> > diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> > index ba673e81716f..01b7765e11ec 100644
> > --- a/net/core/netdev-genl-gen.c
> > +++ b/net/core/netdev-genl-gen.c
> > @@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
> >  };
> >
> >  /* NETDEV_CMD_BIND_RX - do */
> > -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
> > +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
> >         [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> >         [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
> >         [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> > +       [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
> 
> This patch in the series is proving complicated to review. Any changes
> you can refactor out of it for ease of review would be very welcome.
> Things like the variable renames, netlink api changes, and any
> refactors can be pulled into their own patches would ease review of
> the core functionality.
> 
> >  };
> >
> >  /* NETDEV_CMD_NAPI_SET - do */
> > @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
> >                 .cmd            = NETDEV_CMD_BIND_RX,
> >                 .doit           = netdev_nl_bind_rx_doit,
> >                 .policy         = netdev_bind_rx_nl_policy,
> > -               .maxattr        = NETDEV_A_DMABUF_FD,
> > +               .maxattr        = NETDEV_A_DMABUF_AUTORELEASE,
> >                 .flags          = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> >         },
> >         {
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 470fabbeacd9..c742bb34865e 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         struct netdev_nl_sock *priv;
> >         struct net_device *netdev;
> >         unsigned long *rxq_bitmap;
> > +       bool autorelease = false;
> >         struct device *dma_dev;
> >         struct sk_buff *rsp;
> >         int err = 0;
> > @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> >         dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
> >
> > +       if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> > +               autorelease =
> > +                       !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> > +
> >         priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> >         if (IS_ERR(priv))
> >                 return PTR_ERR(priv);
> > @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         }
> >
> >         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> > -                                        dmabuf_fd, priv, info->extack);
> > +                                        dmabuf_fd, priv, info->extack,
> > +                                        autorelease);
> >         if (IS_ERR(binding)) {
> >                 err = PTR_ERR(binding);
> >                 goto err_rxq_bitmap;
> > @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
> >
> >         dma_dev = netdev_queue_get_dma_dev(netdev, 0);
> >         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> > -                                        dmabuf_fd, priv, info->extack);
> > +                                        dmabuf_fd, priv, info->extack, false);
> >         if (IS_ERR(binding)) {
> >                 err = PTR_ERR(binding);
> >                 goto err_unlock_netdev;
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index f6526f43aa6e..6355c2ccfb8a 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -87,6 +87,7 @@
> >
> >  #include <linux/unaligned.h>
> >  #include <linux/capability.h>
> > +#include <linux/dma-buf.h>
> >  #include <linux/errno.h>
> >  #include <linux/errqueue.h>
> >  #include <linux/types.h>
> > @@ -151,6 +152,7 @@
> >  #include <uapi/linux/pidfd.h>
> >
> >  #include "dev.h"
> > +#include "devmem.h"
> >
> >  static DEFINE_MUTEX(proto_list_mutex);
> >  static LIST_HEAD(proto_list);
> > @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> >  #define MAX_DONTNEED_TOKENS 128
> >  #define MAX_DONTNEED_FRAGS 1024
> >
> > +static noinline_for_stack int
> > +sock_devmem_dontneed_manual_release(struct sock *sk,
> > +                                   struct dmabuf_token *tokens,
> > +                                   unsigned int num_tokens)
> > +{
> > +       struct net_iov *niov;
> > +       unsigned int i, j;
> > +       netmem_ref netmem;
> > +       unsigned int token;
> > +       int num_frags = 0;
> > +       int ret = 0;
> > +
> > +       if (!sk->sk_devmem_info.binding)
> > +               return -EINVAL;
> > +
> > +       for (i = 0; i < num_tokens; i++) {
> > +               for (j = 0; j < tokens[i].token_count; j++) {
> > +                       size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> > +
> > +                       token = tokens[i].token_start + j;
> > +                       if (token >= size / PAGE_SIZE)
> > +                               break;
> > +
> > +                       if (++num_frags > MAX_DONTNEED_FRAGS)
> > +                               return ret;
> > +
> > +                       niov = sk->sk_devmem_info.binding->vec[token];
> > +                       if (atomic_dec_and_test(&niov->uref)) {
> > +                               netmem = net_iov_to_netmem(niov);
> > +                               WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +                       }
> > +                       ret++;
> > +               }
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> >  static noinline_for_stack int
> >  sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> >                                  unsigned int num_tokens)
> > @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> >         int ret = 0, num_frags = 0;
> >         netmem_ref netmems[16];
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >         for (i = 0; i < num_tokens; i++) {
> >                 for (j = 0; j < tokens[i].token_count; j++) {
> >                         if (++num_frags > MAX_DONTNEED_FRAGS)
> >                                 goto frag_limit_reached;
> >
> >                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > -                               &sk->sk_user_frags, tokens[i].token_start + j);
> > +                               &sk->sk_devmem_info.frags,
> > +                               tokens[i].token_start + j);
> >
> >                         if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> >                                 continue;
> >
> >                         netmems[netmem_num++] = netmem;
> >                         if (netmem_num == ARRAY_SIZE(netmems)) {
> > -                               xa_unlock_bh(&sk->sk_user_frags);
> > +                               xa_unlock_bh(&sk->sk_devmem_info.frags);
> >                                 for (k = 0; k < netmem_num; k++)
> >                                         WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >                                 netmem_num = 0;
> > -                               xa_lock_bh(&sk->sk_user_frags);
> > +                               xa_lock_bh(&sk->sk_devmem_info.frags);
> >                         }
> >                         ret++;
> >                 }
> >         }
> >
> >  frag_limit_reached:
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >         for (k = 0; k < netmem_num; k++)
> >                 WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >
> > @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> >                 return -EFAULT;
> >         }
> >
> > -       ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > +       if (net_devmem_autorelease_enabled())
> > +               ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > +       else
> > +               ret = sock_devmem_dontneed_manual_release(sk, tokens,
> > +                                                         num_tokens);
> >
> >         kvfree(tokens);
> >         return ret;
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index d5319ebe2452..73a577bd8765 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -260,6 +260,7 @@
> >  #include <linux/memblock.h>
> >  #include <linux/highmem.h>
> >  #include <linux/cache.h>
> > +#include <linux/dma-buf.h>
> >  #include <linux/err.h>
> >  #include <linux/time.h>
> >  #include <linux/slab.h>
> > @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
> >
> >         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> >         sk_sockets_allocated_inc(sk);
> > -       xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> > +       xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > +       sk->sk_devmem_info.binding = NULL;
> >  }
> >  EXPORT_IPV6_MOD(tcp_init_sock);
> >
> > @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> >         /* Commit part that has been copied to user space. */
> >         for (i = 0; i < p->idx; i++)
> > -               __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> > -                            (__force void *)p->netmems[i], GFP_KERNEL);
> > +               __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> > +                            XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> > +                            GFP_KERNEL);
> >         /* Rollback what has been pre-allocated and is no longer needed. */
> >         for (; i < p->max; i++)
> > -               __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> > +               __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
> >
> >         p->max = 0;
> >         p->idx = 0;
> > @@ -2436,14 +2439,17 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> >  static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
> >  {
> > +       if (!net_devmem_autorelease_enabled())
> > +               return;
> > +
> >         if (!p->max)
> >                 return;
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> >         tcp_xa_pool_commit_locked(sk, p);
> >
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >  }
> >
> >  static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > @@ -2454,24 +2460,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> >         if (p->idx < p->max)
> >                 return 0;
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> >         tcp_xa_pool_commit_locked(sk, p);
> >
> >         for (k = 0; k < max_frags; k++) {
> > -               err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> > +               err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
> >                                  XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> >                 if (err)
> >                         break;
> >         }
> >
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >
> >         p->max = k;
> >         p->idx = 0;
> >         return k ? 0 : err;
> >  }
> >
> > +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> > +                                        skb_frag_t *frag)
> > +{
> > +       struct net_iov *niov;
> > +
> > +       niov = skb_frag_net_iov(frag);
> > +
> > +       if (net_devmem_autorelease_enabled()) {
> > +               atomic_long_inc(&niov->desc.pp_ref_count);
> > +               tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> > +                       skb_frag_netmem(frag);
> > +       } else {
> > +               if (atomic_inc_return(&niov->uref) == 1)
> > +                       atomic_long_inc(&niov->desc.pp_ref_count);
> > +       }
> > +}
> > +
> >  /* On error, returns the -errno. On success, returns number of bytes sent to the
> >   * user. May not consume all of @remaining_len.
> >   */
> > @@ -2533,6 +2556,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                  * sequence of cmsg
> >                  */
> >                 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> > +                       struct net_devmem_dmabuf_binding *binding = NULL;
> >                         skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> >                         struct net_iov *niov;
> >                         u64 frag_offset;
> > @@ -2568,13 +2592,45 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                                               start;
> >                                 dmabuf_cmsg.frag_offset = frag_offset;
> >                                 dmabuf_cmsg.frag_size = copy;
> > -                               err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> > -                                                        skb_shinfo(skb)->nr_frags - i);
> > -                               if (err)
> > -                                       goto out;
> > +
> > +                               binding = net_devmem_iov_binding(niov);
> > +
> > +                               if (net_devmem_autorelease_enabled()) {
> > +                                       err = tcp_xa_pool_refill(sk,
> > +                                                                &tcp_xa_pool,
> > +                                                                skb_shinfo(skb)->nr_frags - i);
> > +                                       if (err)
> > +                                               goto out;
> > +
> > +                                       dmabuf_cmsg.frag_token =
> > +                                               tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > +                               } else {
> > +                                       if (!sk->sk_devmem_info.binding) {
> > +                                               if (!net_devmem_dmabuf_binding_user_get(binding)) {
> > +                                                       err = -ENODEV;
> > +                                                       goto out;
> > +                                               }
> > +
> > +                                               if (!net_devmem_dmabuf_binding_get(binding)) {
> > +                                                       net_devmem_dmabuf_binding_user_put(binding);
> > +                                                       err = -ENODEV;
> > +                                                       goto out;
> > +                                               }
> > +
> > +                                               sk->sk_devmem_info.binding = binding;
> > +                                       }
> > +
> > +                                       if (sk->sk_devmem_info.binding != binding) {
> > +                                               err = -EFAULT;
> > +                                               goto out;
> > +                                       }
> > +
> > +                                       dmabuf_cmsg.frag_token =
> > +                                               net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> > +                               }
> > +
> >
> >                                 /* Will perform the exchange later */
> > -                               dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
> >                                 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
> >
> >                                 offset += copy;
> > @@ -2587,8 +2643,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                                 if (err)
> >                                         goto out;
> >
> > -                               atomic_long_inc(&niov->desc.pp_ref_count);
> > -                               tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> > +                               tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
> >
> >                                 sent += copy;
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index f8a9596e8f4d..420e8c8ebf6d 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -89,6 +89,9 @@
> >
> >  #include <crypto/md5.h>
> >
> > +#include <linux/dma-buf.h>
> > +#include "../core/devmem.h"
> > +
> >  #include <trace/events/tcp.h>
> >
> >  #ifdef CONFIG_TCP_MD5SIG
> > @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
> >         unsigned long index;
> >         void *netmem;
> >
> > -       xa_for_each(&sk->sk_user_frags, index, netmem)
> > +       xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
> >                 WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> >  #endif
> >  }
> > @@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
> >
> >         tcp_release_user_frags(sk);
> >
> > -       xa_destroy(&sk->sk_user_frags);
> > +       if (!net_devmem_autorelease_enabled() && sk->sk_devmem_info.binding) {
> > +               net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
> > +               net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> > +               sk->sk_devmem_info.binding = NULL;
> > +               WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
> > +                         "non-empty xarray discovered in autorelease off mode");
> > +       }
> > +
> > +       xa_destroy(&sk->sk_devmem_info.frags);
> >
> >         trace_tcp_destroy_sock(sk);
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index bd5462154f97..2aec977f5c12 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> >
> >         __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> >
> > -       xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> > +       xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > +       newsk->sk_devmem_info.binding = NULL;
> >
> >         return newsk;
> >  }
> > diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/tools/include/uapi/linux/netdev.h
> > +++ b/tools/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> >         NETDEV_A_DMABUF_QUEUES,
> >         NETDEV_A_DMABUF_FD,
> >         NETDEV_A_DMABUF_ID,
> > +       NETDEV_A_DMABUF_AUTORELEASE,
> >
> >         __NETDEV_A_DMABUF_MAX,
> >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> >
> > --
> > 2.47.3
> >
> 
> 
> -- 
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-22  4:21   ` Mina Almasry
@ 2026-01-26 18:45     ` Bobby Eshleman
  2026-01-27  1:31       ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-26 18:45 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, Jan 21, 2026 at 08:21:36PM -0800, Mina Almasry wrote:
> On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > > This series improves the CPU cost of RX token management by adding an
> > > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > > binding to avoid the xarray allocator and instead use a per-binding niov
> > > array and a uref field in niov.
> > >
> > > Improvement is ~13% cpu util per RX user thread.
> > >
> > > Using kperf, the following results were observed:
> > >
> > > Before:
> > >       Average RX worker idle %: 13.13, flows 4, test runs 11
> > > After:
> > >       Average RX worker idle %: 26.32, flows 4, test runs 11
> > >
> > > Two other approaches were tested, but with no improvement. Namely, 1)
> > > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > > but using RCU so that the hotpath could be mostly lockless. Neither of
> > > these approaches proved better than the simple array in terms of CPU.
> > >
> > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > > optimization on).
> >
> > IDK if the cmsg approach is still right for this flow TBH.
> > IIRC when Stan talked about this a while back we were considering doing
> > this via Netlink. Anything that proves that the user owns the binding
> > would work. IIUC the TCP socket in this design just proves that socket
> > has received a token from a given binding right?
> 
> Doesn't 'doing this via netlink' imply it's a control path operation
> that acquires rtnl_lock or netdev_lock or some heavy lock expecting
> you to do some config change? Returning tokens is a data-path
> operation, IIRC we don't even lock the socket to do it in the
> setsockopt.
> 
> Is there precedent/path to doing fast data-path operations via netlink?
> There may be value in not biting more than we can chew in one series.
> Maybe an alternative non-setsockopt dontneeding scheme should be its
> own patch series.
> 

I'm onboard with improving what we have since it helps all of us
currently using this API, though I'm not opposed to discussing a
redesign in another thread/RFC. I do see the attraction to locating the
core logic in one place and possibly reducing some complexity around
socket/binding relationships.

FWIW regarding nl, I do see it supports rtnl lock-free operations via
'62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
recently made lockless with that. I don't see / know of any fast path
precedent. I'm aware there are some things I'm not sure about being
relevant performance-wise, like hitting skb alloc an additional time
every release batch. I'd want to do some minimal latency comparisons
between that path and sockopt before diving head-first.

Best,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-22  4:07                 ` Stanislav Fomichev
@ 2026-01-27  1:26                   ` Jakub Kicinski
  2026-01-27  2:30                     ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-27  1:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, 21 Jan 2026 20:07:11 -0800 Stanislav Fomichev wrote:
> On 01/21, Jakub Kicinski wrote:
> > On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote:  
> > > Good point. The only real use case for autorelease=on is for backwards
> > > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
> > > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?  
> > 
> > Hm. Maybe let's return to naming once we have consensus on the uAPI.
> > 
> > Does everyone think that pushing this via TCP socket opts still makes
> > sense, even tho in practice the TCP socket is just how we find the
> > binding?  
> 
> I'm not a fan of the existing cmsg scheme, but we already have userspace
> using it, so getting more performance out of it seems like an easy win?

I don't like:
 - the fact that we have to add the binding to a socket (extra field)
   - single socket can only serve single binding, there's no technical
     reason for this really, AFAICT, just the fact that we have a single
     pointer in the sock struct
 - the 7 levels of indentation in tcp_recvmsg_dmabuf()

I understand your argument, but as is this series feels closer to a PoC
than an easy win (the easy part should imply minor changes and no
detrimental effect on code quality IMHO).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-26 18:45     ` Bobby Eshleman
@ 2026-01-27  1:31       ` Jakub Kicinski
  2026-01-27  6:00         ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-27  1:31 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> I'm onboard with improving what we have since it helps all of us
> currently using this API, though I'm not opposed to discussing a
> redesign in another thread/RFC. I do see the attraction to locating the
> core logic in one place and possibly reducing some complexity around
> socket/binding relationships.
> 
> FWIW regarding nl, I do see it supports rtnl lock-free operations via
> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> recently made lockless with that. I don't see / know of any fast path
> precedent. I'm aware there are some things I'm not sure about being
> relevant performance-wise, like hitting skb alloc an additional time
> every release batch. I'd want to do some minimal latency comparisons
> between that path and sockopt before diving head-first.

FTR I'm not really pushing Netlink specifically, it may work it 
may not. Perhaps some other ioctl-y thing exists. Just in general
setsockopt() on a specific socket feels increasingly awkward for 
buffer flow. Maybe y'all disagree.

I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-27  1:26                   ` Jakub Kicinski
@ 2026-01-27  2:30                     ` Bobby Eshleman
  2026-01-27  2:44                       ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-27  2:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, Jan 26, 2026 at 05:26:46PM -0800, Jakub Kicinski wrote:
> On Wed, 21 Jan 2026 20:07:11 -0800 Stanislav Fomichev wrote:
> > On 01/21, Jakub Kicinski wrote:
> > > On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote:  
> > > > Good point. The only real use case for autorelease=on is for backwards
> > > > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
> > > > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?  
> > > 
> > > Hm. Maybe let's return to naming once we have consensus on the uAPI.
> > > 
> > > Does everyone think that pushing this via TCP socket opts still makes
> > > sense, even tho in practice the TCP socket is just how we find the
> > > binding?  
> > 
> > I'm not a fan of the existing cmsg scheme, but we already have userspace
> > using it, so getting more performance out of it seems like an easy win?
> 
> I don't like:
>  - the fact that we have to add the binding to a socket (extra field)
>    - single socket can only serve single binding, there's no technical
>      reason for this really, AFAICT, just the fact that we have a single
>      pointer in the sock struct

The core reason is that sockets lose the ability to map a given token to
a given binding by no longer storing the niov ptr.

One proposal I had was to encode some number of bits in the token that
can be used to lookup the binding in an array, I could reboot that
approach.

With 32 bits, we can represent:

dmabuf max size = 512 GB, max dmabuf count = 8
dmabuf max size = 256 GB, max dmabuf count = 16
dmabuf max size = 128 GB, max dmabuf count = 32

etc...

Then, if the dmabuf count encoding space is exhausted, the socket would
have to wait until the user returns all of the tokens from one of the
dmabufs and frees the ID (or err out is another option).

This wouldn't change adding a field to the socket, we'd have to add one
or two more for allocating the dmabuf ID and fetching the dmabuf with
it. But it does fix the single binding thing.

>  - the 7 levels of indentation in tcp_recvmsg_dmabuf()

For sure, it is getting hairy.

> 
> I understand your argument, but as is this series feels closer to a PoC
> than an easy win (the easy part should imply minor changes and no
> detrimental effect on code quality IMHO).

Sure, let's try to find a way to minimize the changes.

Best,
Bobby

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-27  2:30                     ` Bobby Eshleman
@ 2026-01-27  2:44                       ` Jakub Kicinski
  2026-01-27  3:06                         ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-27  2:44 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, 26 Jan 2026 18:30:45 -0800 Bobby Eshleman wrote:
> > > I'm not a fan of the existing cmsg scheme, but we already have userspace
> > > using it, so getting more performance out of it seems like an easy win?  
> > 
> > I don't like:
> >  - the fact that we have to add the binding to a socket (extra field)
> >    - single socket can only serve single binding, there's no technical
> >      reason for this really, AFAICT, just the fact that we have a single
> >      pointer in the sock struct  
> 
> The core reason is that sockets lose the ability to map a given token to
> a given binding by no longer storing the niov ptr.
> 
> One proposal I had was to encode some number of bits in the token that
> can be used to lookup the binding in an array, I could reboot that
> approach.
> 
> With 32 bits, we can represent:
> 
> dmabuf max size = 512 GB, max dmabuf count = 8
> dmabuf max size = 256 GB, max dmabuf count = 16
> dmabuf max size = 128 GB, max dmabuf count = 32
> 
> etc...
> 
> Then, if the dmabuf count encoding space is exhausted, the socket would
> have to wait until the user returns all of the tokens from one of the
> dmabufs and frees the ID (or err out is another option).
> 
> This wouldn't change adding a field to the socket, we'd have to add one
> or two more for allocating the dmabuf ID and fetching the dmabuf with
> it. But it does fix the single binding thing.

I think the bigger problem (than space exhaustion) is that we'd also
have some understanding of permissions. If an application guesses 
the binding ID of another app it can mess up its buffers. ENOBUENO..

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-27  2:44                       ` Jakub Kicinski
@ 2026-01-27  3:06                         ` Bobby Eshleman
  2026-01-27  3:43                           ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-27  3:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, Jan 26, 2026 at 06:44:40PM -0800, Jakub Kicinski wrote:
> On Mon, 26 Jan 2026 18:30:45 -0800 Bobby Eshleman wrote:
> > > > I'm not a fan of the existing cmsg scheme, but we already have userspace
> > > > using it, so getting more performance out of it seems like an easy win?  
> > > 
> > > I don't like:
> > >  - the fact that we have to add the binding to a socket (extra field)
> > >    - single socket can only serve single binding, there's no technical
> > >      reason for this really, AFAICT, just the fact that we have a single
> > >      pointer in the sock struct  
> > 
> > The core reason is that sockets lose the ability to map a given token to
> > a given binding by no longer storing the niov ptr.
> > 
> > One proposal I had was to encode some number of bits in the token that
> > can be used to lookup the binding in an array, I could reboot that
> > approach.
> > 
> > With 32 bits, we can represent:
> > 
> > dmabuf max size = 512 GB, max dmabuf count = 8
> > dmabuf max size = 256 GB, max dmabuf count = 16
> > dmabuf max size = 128 GB, max dmabuf count = 32
> > 
> > etc...
> > 
> > Then, if the dmabuf count encoding space is exhausted, the socket would
> > have to wait until the user returns all of the tokens from one of the
> > dmabufs and frees the ID (or err out is another option).
> > 
> > This wouldn't change adding a field to the socket, we'd have to add one
> > or two more for allocating the dmabuf ID and fetching the dmabuf with
> > it. But it does fix the single binding thing.
> 
> I think the bigger problem (than space exhaustion) is that we'd also
> have some understanding of permissions. If an application guesses 
> the binding ID of another app it can mess up its buffers. ENOBUENO..

I was thinking it would be per-socket, effectively:

sk->sk_devmem_info.bindings[binding_id_from_token(token)]

So sockets could only access those that they have already recv'd on.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-27  3:06                         ` Bobby Eshleman
@ 2026-01-27  3:43                           ` Jakub Kicinski
  2026-01-27  3:50                             ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2026-01-27  3:43 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, 26 Jan 2026 19:06:49 -0800 Bobby Eshleman wrote:
> > > Then, if the dmabuf count encoding space is exhausted, the socket would
> > > have to wait until the user returns all of the tokens from one of the
> > > dmabufs and frees the ID (or err out is another option).
> > > 
> > > This wouldn't change adding a field to the socket, we'd have to add one
> > > or two more for allocating the dmabuf ID and fetching the dmabuf with
> > > it. But it does fix the single binding thing.  
> > 
> > I think the bigger problem (than space exhaustion) is that we'd also
> > have some understanding of permissions. If an application guesses 
> > the binding ID of another app it can mess up its buffers. ENOBUENO..  
> 
> I was thinking it would be per-socket, effectively:
> 
> sk->sk_devmem_info.bindings[binding_id_from_token(token)]
> 
> So sockets could only access those that they have already recv'd on.

Ah, missed that the array would be per socket. I guess it'd have to be
reusing the token xarray otherwise we're taking up even more space in
the socket struct? Dunno.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-27  3:43                           ` Jakub Kicinski
@ 2026-01-27  3:50                             ` Bobby Eshleman
  0 siblings, 0 replies; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-27  3:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, Jan 26, 2026 at 07:43:59PM -0800, Jakub Kicinski wrote:
> On Mon, 26 Jan 2026 19:06:49 -0800 Bobby Eshleman wrote:
> > > > Then, if the dmabuf count encoding space is exhausted, the socket would
> > > > have to wait until the user returns all of the tokens from one of the
> > > > dmabufs and frees the ID (or err out is another option).
> > > > 
> > > > This wouldn't change adding a field to the socket, we'd have to add one
> > > > or two more for allocating the dmabuf ID and fetching the dmabuf with
> > > > it. But it does fix the single binding thing.  
> > > 
> > > I think the bigger problem (than space exhaustion) is that we'd also
> > > have some understanding of permissions. If an application guesses 
> > > the binding ID of another app it can mess up its buffers. ENOBUENO..  
> > 
> > I was thinking it would be per-socket, effectively:
> > 
> > sk->sk_devmem_info.bindings[binding_id_from_token(token)]
> > 
> > So sockets could only access those that they have already recv'd on.
> 
> Ah, missed that the array would be per socket. I guess it'd have to be
> reusing the token xarray otherwise we're taking up even more space in
> the socket struct? Dunno.

Yeah, unless we just want to break this all off into a malloc'd struct
we point to... or put into tcp_sock (not sure if either addresses the
unappealing bit of adding to struct sock)?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-27  1:31       ` Jakub Kicinski
@ 2026-01-27  6:00         ` Stanislav Fomichev
  2026-01-27  6:48           ` Bobby Eshleman
  0 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2026-01-27  6:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, Mina Almasry, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On 01/26, Jakub Kicinski wrote:
> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> > I'm onboard with improving what we have since it helps all of us
> > currently using this API, though I'm not opposed to discussing a
> > redesign in another thread/RFC. I do see the attraction to locating the
> > core logic in one place and possibly reducing some complexity around
> > socket/binding relationships.
> > 
> > FWIW regarding nl, I do see it supports rtnl lock-free operations via
> > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> > recently made lockless with that. I don't see / know of any fast path
> > precedent. I'm aware there are some things I'm not sure about being
> > relevant performance-wise, like hitting skb alloc an additional time
> > every release batch. I'd want to do some minimal latency comparisons
> > between that path and sockopt before diving head-first.
> 
> FTR I'm not really pushing Netlink specifically, it may work it 
> may not. Perhaps some other ioctl-y thing exists. Just in general
> setsockopt() on a specific socket feels increasingly awkward for 
> buffer flow. Maybe y'all disagree.
> 
> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)

From my side, if we do a completely new uapi, my preference would be on
an af_xdp like mapped rings (presumably on a netlink socket?) to completely
avoid the user-kernel copies.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-27  6:00         ` Stanislav Fomichev
@ 2026-01-27  6:48           ` Bobby Eshleman
  2026-01-30 11:13             ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Bobby Eshleman @ 2026-01-27  6:48 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, Mina Almasry, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev
<stfomichev@gmail.com> wrote:
>
> On 01/26, Jakub Kicinski wrote:
> > On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> > > I'm onboard with improving what we have since it helps all of us
> > > currently using this API, though I'm not opposed to discussing a
> > > redesign in another thread/RFC. I do see the attraction to locating the
> > > core logic in one place and possibly reducing some complexity around
> > > socket/binding relationships.
> > >
> > > FWIW regarding nl, I do see it supports rtnl lock-free operations via
> > > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> > > recently made lockless with that. I don't see / know of any fast path
> > > precedent. I'm aware there are some things I'm not sure about being
> > > relevant performance-wise, like hitting skb alloc an additional time
> > > every release batch. I'd want to do some minimal latency comparisons
> > > between that path and sockopt before diving head-first.
> >
> > FTR I'm not really pushing Netlink specifically, it may work it
> > may not. Perhaps some other ioctl-y thing exists. Just in general
> > setsockopt() on a specific socket feels increasingly awkward for
> > buffer flow. Maybe y'all disagree.
> >
> > I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>
> From my side, if we do a completely new uapi, my preference would be on
> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
> avoid the user-kernel copies.

I second liking that approach. No put_cmsg() and or token alloc overhead (both
jump up in my profiling).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-27  6:48           ` Bobby Eshleman
@ 2026-01-30 11:13             ` Pavel Begunkov
  2026-02-05  3:48               ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2026-01-30 11:13 UTC (permalink / raw)
  To: Bobby Eshleman, Stanislav Fomichev
  Cc: Jakub Kicinski, Mina Almasry, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	matttbe, skhawaja, Bobby Eshleman

On 1/27/26 06:48, Bobby Eshleman wrote:
> On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
>>
>> On 01/26, Jakub Kicinski wrote:
>>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
>>>> I'm onboard with improving what we have since it helps all of us
>>>> currently using this API, though I'm not opposed to discussing a
>>>> redesign in another thread/RFC. I do see the attraction to locating the
>>>> core logic in one place and possibly reducing some complexity around
>>>> socket/binding relationships.
>>>>
>>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via
>>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
>>>> recently made lockless with that. I don't see / know of any fast path
>>>> precedent. I'm aware there are some things I'm not sure about being
>>>> relevant performance-wise, like hitting skb alloc an additional time
>>>> every release batch. I'd want to do some minimal latency comparisons
>>>> between that path and sockopt before diving head-first.
>>>
>>> FTR I'm not really pushing Netlink specifically, it may work it
>>> may not. Perhaps some other ioctl-y thing exists. Just in general
>>> setsockopt() on a specific socket feels increasingly awkward for
>>> buffer flow. Maybe y'all disagree.
>>>
>>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>>
>>  From my side, if we do a completely new uapi, my preference would be on
>> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
>> avoid the user-kernel copies.
> 
> I second liking that approach. No put_cmsg() and or token alloc overhead (both
> jump up in my profiling).

Hmm, makes me wonder why not use zcrx instead of reinventing it? It
doesn't bind net_iov to sockets just as you do in this series. And it
also returns buffers back via a shared ring. Otherwise you'll be facing
same issues, like rings running out of space, and so you will need to
have a fallback path. And user space will need to synchronise the ring
if it's shared with other threads, and there will be a question of how
to scale it next, possibly by creating multiple rings as I'll likely to
do soon for zcrx.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management
  2026-01-30 11:13             ` Pavel Begunkov
@ 2026-02-05  3:48               ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2026-02-05  3:48 UTC (permalink / raw)
  To: Pavel Begunkov, Bobby Eshleman, Stanislav Fomichev
  Cc: Jakub Kicinski, Mina Almasry, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	matttbe, skhawaja, Bobby Eshleman

On 1/30/26 4:13 AM, Pavel Begunkov wrote:
> On 1/27/26 06:48, Bobby Eshleman wrote:
>> On Mon, Jan 26, 2026 at 10:00?PM Stanislav Fomichev
>> <stfomichev@gmail.com> wrote:
>>>
>>> On 01/26, Jakub Kicinski wrote:
>>>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
>>>>> I'm onboard with improving what we have since it helps all of us
>>>>> currently using this API, though I'm not opposed to discussing a
>>>>> redesign in another thread/RFC. I do see the attraction to locating the
>>>>> core logic in one place and possibly reducing some complexity around
>>>>> socket/binding relationships.
>>>>>
>>>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via
>>>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
>>>>> recently made lockless with that. I don't see / know of any fast path
>>>>> precedent. I'm aware there are some things I'm not sure about being
>>>>> relevant performance-wise, like hitting skb alloc an additional time
>>>>> every release batch. I'd want to do some minimal latency comparisons
>>>>> between that path and sockopt before diving head-first.
>>>>
>>>> FTR I'm not really pushing Netlink specifically, it may work it
>>>> may not. Perhaps some other ioctl-y thing exists. Just in general
>>>> setsockopt() on a specific socket feels increasingly awkward for
>>>> buffer flow. Maybe y'all disagree.
>>>>
>>>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>>>
>>>  From my side, if we do a completely new uapi, my preference would be on
>>> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
>>> avoid the user-kernel copies.
>>
>> I second liking that approach. No put_cmsg() and or token alloc
>> overhead (both jump up in my profiling).
> 
> Hmm, makes me wonder why not use zcrx instead of reinventing it? It

Was thinking the same throughout most of this later discussion... We
already have an API for this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-02-05  3:48 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16  5:02 [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
2026-01-16  5:02 ` [PATCH net-next v10 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
2026-01-16  5:02 ` [PATCH net-next v10 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
2026-01-16  5:02 ` [PATCH net-next v10 3/5] net: devmem: implement autorelease token management Bobby Eshleman
2026-01-21  1:00   ` Jakub Kicinski
2026-01-21  5:33     ` Bobby Eshleman
2026-01-22  4:15   ` Mina Almasry
2026-01-22  5:18     ` Bobby Eshleman
2026-01-16  5:02 ` [PATCH net-next v10 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
2026-01-21  0:36   ` Jakub Kicinski
2026-01-21  5:44     ` Bobby Eshleman
2026-01-22  1:35       ` Jakub Kicinski
2026-01-22  2:37         ` Bobby Eshleman
2026-01-22  2:50           ` Jakub Kicinski
2026-01-22  3:25             ` Bobby Eshleman
2026-01-22  3:46               ` Jakub Kicinski
2026-01-22  4:07                 ` Stanislav Fomichev
2026-01-27  1:26                   ` Jakub Kicinski
2026-01-27  2:30                     ` Bobby Eshleman
2026-01-27  2:44                       ` Jakub Kicinski
2026-01-27  3:06                         ` Bobby Eshleman
2026-01-27  3:43                           ` Jakub Kicinski
2026-01-27  3:50                             ` Bobby Eshleman
2026-01-16  5:02 ` [PATCH net-next v10 5/5] selftests: drv-net: devmem: add autorelease tests Bobby Eshleman
2026-01-21  1:07 ` [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management Jakub Kicinski
2026-01-21  5:29   ` Bobby Eshleman
2026-01-22  1:37     ` Jakub Kicinski
2026-01-22  4:21   ` Mina Almasry
2026-01-26 18:45     ` Bobby Eshleman
2026-01-27  1:31       ` Jakub Kicinski
2026-01-27  6:00         ` Stanislav Fomichev
2026-01-27  6:48           ` Bobby Eshleman
2026-01-30 11:13             ` Pavel Begunkov
2026-02-05  3:48               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox