[PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management
@ 2026-01-10  2:18 Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.

Improvement is ~13% cpu util per RX user thread.

Using kperf, the following results were observed:

Before:
	Average RX worker idle %: 13.13, flows 4, test runs 11
After:
	Average RX worker idle %: 26.32, flows 4, test runs 11

Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.

The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-upstream-v8-0-92c968631496@meta.com

Changes in v8:
- change static branch logic (only set when enabled, otherwise just
  always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-upstream-v7-0-1abc8467354c@meta.com

Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
  with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-0-ea98cf4d40b3@meta.com

Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
  optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
  field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-upstream-v5-0-47cb85f5259e@meta.com

Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com

Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v3-0-084b46bda88f@meta.com

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket
  leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com

Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
  (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com

---
Bobby Eshleman (5):
      net: devmem: rename tx_vec to vec in dmabuf binding
      net: devmem: refactor sock_devmem_dontneed for autorelease split
      net: devmem: implement autorelease token management
      net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
      selftests: drv-net: devmem: add autorelease test

 Documentation/netlink/specs/netdev.yaml           |  12 +++
 Documentation/networking/devmem.rst               |  70 +++++++++++++
 include/net/netmem.h                              |   1 +
 include/net/sock.h                                |   7 +-
 include/uapi/linux/netdev.h                       |   1 +
 net/core/devmem.c                                 | 116 ++++++++++++++++++----
 net/core/devmem.h                                 |  29 +++++-
 net/core/netdev-genl-gen.c                        |   5 +-
 net/core/netdev-genl.c                            |  10 +-
 net/core/sock.c                                   | 103 ++++++++++++++-----
 net/ipv4/tcp.c                                    |  76 +++++++++++---
 net/ipv4/tcp_ipv4.c                               |  11 +-
 net/ipv4/tcp_minisocks.c                          |   3 +-
 tools/include/uapi/linux/netdev.h                 |   1 +
 tools/testing/selftests/drivers/net/hw/devmem.py  |  21 +++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c |  19 ++--
 16 files changed, 407 insertions(+), 78 deletions(-)
---
base-commit: 6ad078fa0ababa8de2a2b39f476d2abd179a3cf6
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v9 1/5] net: devmem: rename tx_vec to vec in dmabuf binding
  2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
@ 2026-01-10  2:18 ` Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'.
This field holds pointers to net_iov structures. The rename prepares for
reusing 'vec' for both TX and RX directions.

No functional change intended.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/devmem.c | 22 +++++++++++-----------
 net/core/devmem.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index ec4217d6c0b4..05a9a9e7abb9 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -75,7 +75,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 	dma_buf_detach(binding->dmabuf, binding->attachment);
 	dma_buf_put(binding->dmabuf);
 	xa_destroy(&binding->bound_rxqs);
-	kvfree(binding->tx_vec);
+	kvfree(binding->vec);
 	kfree(binding);
 }
 
@@ -232,10 +232,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	}
 
 	if (direction == DMA_TO_DEVICE) {
-		binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-						 sizeof(struct net_iov *),
-						 GFP_KERNEL);
-		if (!binding->tx_vec) {
+		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+					      sizeof(struct net_iov *),
+					      GFP_KERNEL);
+		if (!binding->vec) {
 			err = -ENOMEM;
 			goto err_unmap;
 		}
@@ -249,7 +249,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 					      dev_to_node(&dev->dev));
 	if (!binding->chunk_pool) {
 		err = -ENOMEM;
-		goto err_tx_vec;
+		goto err_vec;
 	}
 
 	virtual = 0;
@@ -295,7 +295,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
 			if (direction == DMA_TO_DEVICE)
-				binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
@@ -315,8 +315,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
 	gen_pool_destroy(binding->chunk_pool);
-err_tx_vec:
-	kvfree(binding->tx_vec);
+err_vec:
+	kvfree(binding->vec);
 err_unmap:
 	dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt,
 					  direction);
@@ -363,7 +363,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
 	int err = 0;
 
 	binding = net_devmem_lookup_dmabuf(dmabuf_id);
-	if (!binding || !binding->tx_vec) {
+	if (!binding || !binding->vec) {
 		err = -EINVAL;
 		goto out_err;
 	}
@@ -414,7 +414,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding,
 	*off = virt_addr % PAGE_SIZE;
 	*size = PAGE_SIZE - *off;
 
-	return binding->tx_vec[virt_addr / PAGE_SIZE];
+	return binding->vec[virt_addr / PAGE_SIZE];
 }
 
 /*** "Dmabuf devmem memory provider" ***/
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 0b43a648cd2e..1ea6228e4f40 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding {
 	 * address. This array is convenient to map the virtual addresses to
 	 * net_iovs in the TX path.
 	 */
-	struct net_iov **tx_vec;
+	struct net_iov **vec;
 
 	struct work_struct unbind_w;
 };

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v9 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split
  2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
@ 2026-01-10  2:18 ` Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 3/5] net: devmem: implement autorelease token management Bobby Eshleman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Refactor sock_devmem_dontneed() in preparation for supporting both
autorelease and manual token release modes.

Split the function into two parts:
- sock_devmem_dontneed(): handles input validation, token allocation,
  and copying from userspace
- sock_devmem_dontneed_autorelease(): performs the actual token release
  via xarray lookup and page pool put

This separation allows a future commit to add a parallel
sock_devmem_dontneed_manual_release() function that uses a different
token tracking mechanism (per-niov reference counting) without
duplicating the input validation logic.

The refactoring is purely mechanical with no functional change. Only
intended to minimize the noise in subsequent patches.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/sock.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index a1c8b47b0d56..f6526f43aa6e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1082,30 +1082,13 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_FRAGS 1024
 
 static noinline_for_stack int
-sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
+				 unsigned int num_tokens)
 {
-	unsigned int num_tokens, i, j, k, netmem_num = 0;
-	struct dmabuf_token *tokens;
+	unsigned int i, j, k, netmem_num = 0;
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	if (!sk_is_tcp(sk))
-		return -EBADF;
-
-	if (optlen % sizeof(*tokens) ||
-	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
-		return -EINVAL;
-
-	num_tokens = optlen / sizeof(*tokens);
-	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
-	if (!tokens)
-		return -ENOMEM;
-
-	if (copy_from_sockptr(tokens, optval, optlen)) {
-		kvfree(tokens);
-		return -EFAULT;
-	}
-
 	xa_lock_bh(&sk->sk_user_frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
@@ -1135,6 +1118,35 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
+	return ret;
+}
+
+static noinline_for_stack int
+sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+{
+	struct dmabuf_token *tokens;
+	unsigned int num_tokens;
+	int ret;
+
+	if (!sk_is_tcp(sk))
+		return -EBADF;
+
+	if (optlen % sizeof(*tokens) ||
+	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
+		return -EINVAL;
+
+	num_tokens = optlen / sizeof(*tokens);
+	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
+	if (!tokens)
+		return -ENOMEM;
+
+	if (copy_from_sockptr(tokens, optval, optlen)) {
+		kvfree(tokens);
+		return -EFAULT;
+	}
+
+	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+
 	kvfree(tokens);
 	return ret;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
@ 2026-01-10  2:18 ` Bobby Eshleman
  2026-01-11 19:12   ` Mina Almasry
  2026-01-13  4:00   ` [net-next,v9,3/5] " Jakub Kicinski
  2026-01-10  2:18 ` [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
  2026-01-10  2:18 ` [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test Bobby Eshleman
  4 siblings, 2 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add support for autorelease toggling of tokens using a static branch to
control system-wide behavior. This allows applications to choose between
two memory management modes:

1. Autorelease on: Leaked tokens are automatically released when the
   socket closes.

2. Autorelease off: Leaked tokens are released during dmabuf unbind.

The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
binding is disallowed and is rejected by netlink. The system will be
"locked" into the mode that the first binding is set to. It can only be
changed again once there are zero bindings on the system.

Disabling autorelease offers ~13% improvement in CPU utilization.

Static branching is used to limit the system to one mode or the other.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v9:
- Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
- Add wrapper around tcp_devmem_ar_key accesses so that it may be
  stubbed out when NET_DEVMEM=n
- only dec rx binding count for rx bindings in free (v8 did not exclude
  TX bindings)

Changes in v8:
- Only reset static key when bindings go to zero, defaulting back to
  disabled (Stan).
- Fix bad usage of xarray spinlock for sleepy static branch switching,
  use mutex instead.
- Access pp_ref_count via niov->desc instead of niov directly.
- Move reset of static key to __net_devmem_dmabuf_binding_free() so that
  the static key can not be changed while there are outstanding tokens
  (free is only called when reference count reaches zero).
- Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
  even after xa_erase(), so static key changes must wait until all
  RX bindings are finally freed (not just when xarray is empty). A
  counter is a simple way to track this.
- socket takes reference on the binding, to avoid use-after-free on
  sk_devmem_info.binding in the case that user releases all tokens,
  unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
- removed some comments that were unnecessary

Changes in v7:
- implement autorelease with static branch (Stan)
- use netlink instead of sockopt (Stan)
- merge uAPI and implementation patches into one patch (seemed less
  confusing)

Changes in v6:
- remove sk_devmem_info.autorelease, using binding->autorelease instead
- move binding->autorelease check to outside of
  net_devmem_dmabuf_binding_put_urefs() (Mina)
- remove overly defensive net_is_devmem_iov() (Mina)
- add comment about multiple urefs mapping to a single netmem ref (Mina)
- remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
- use niov without casting back and forth with netmem (Mina)
- move the autorelease flag from per-binding to per-socket (Mina)
- remove the batching logic in sock_devmem_dontneed_manual_release()
  (Mina)
- move autorelease check inside tcp_xa_pool_commit() (Mina)
- remove single-binding restriction for autorelease mode (Mina)
- unbind always checks for leaked urefs

Changes in v5:
- remove unused variables
- introduce autorelease flag, preparing for future patch toggle new
  behavior

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket leaked
  tokens
- drop ethtool patch

Changes in v2:
- always use GFP_ZERO for binding->vec (Mina)
- remove WARN for changed binding (Mina)
- remove extraneous binding ref get (Mina)
- remove WARNs on invalid user input (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- fix length of alloc for urefs
---
 Documentation/netlink/specs/netdev.yaml |  12 ++++
 include/net/netmem.h                    |   1 +
 include/net/sock.h                      |   7 ++-
 include/uapi/linux/netdev.h             |   1 +
 net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
 net/core/devmem.h                       |  27 ++++++++-
 net/core/netdev-genl-gen.c              |   5 +-
 net/core/netdev-genl.c                  |  10 ++-
 net/core/sock.c                         |  57 +++++++++++++++--
 net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
 net/ipv4/tcp_ipv4.c                     |  11 +++-
 net/ipv4/tcp_minisocks.c                |   3 +-
 tools/include/uapi/linux/netdev.h       |   1 +
 13 files changed, 269 insertions(+), 46 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 596c306ce52b..7cbe9e7b9ee5 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -562,6 +562,17 @@ attribute-sets:
         type: u32
         checks:
           min: 1
+      -
+        name: autorelease
+        doc: |
+          Token autorelease mode. If true (1), leaked tokens are automatically
+          released when the socket closes. If false (0), leaked tokens are only
+          released when the dmabuf is unbound. Once a binding is created with a
+          specific mode, all subsequent bindings system-wide must use the same
+          mode.
+
+          Optional. Defaults to false if not specified.
+        type: u8
 
 operations:
   list:
@@ -769,6 +780,7 @@ operations:
             - ifindex
             - fd
             - queues
+            - autorelease
         reply:
           attributes:
             - id
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 9e10f4ac50c3..80d2263ba4ed 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -112,6 +112,7 @@ struct net_iov {
 	};
 	struct net_iov_area *owner;
 	enum net_iov_type type;
+	atomic_t uref;
 };
 
 struct net_iov_area {
diff --git a/include/net/sock.h b/include/net/sock.h
index aafe8bdb2c0f..9d3d5bde15e9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -352,7 +352,7 @@ struct sk_filter;
   *	@sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
   *	@sk_scm_unused: unused flags for scm_recv()
   *	@ns_tracker: tracker for netns reference
-  *	@sk_user_frags: xarray of pages the user is holding a reference on.
+  *	@sk_devmem_info: the devmem binding information for the socket
   *	@sk_owner: reference to the real owner of the socket that calls
   *		   sock_lock_init_class_and_name().
   */
@@ -584,7 +584,10 @@ struct sock {
 	struct numa_drop_counters *sk_drop_counters;
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
-	struct xarray		sk_user_frags;
+	struct {
+		struct xarray				frags;
+		struct net_devmem_dmabuf_binding	*binding;
+	} sk_devmem_info;
 
 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 	struct module		*sk_owner;
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e0b579a1df4f..1e5c209cb998 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -207,6 +207,7 @@ enum {
 	NETDEV_A_DMABUF_QUEUES,
 	NETDEV_A_DMABUF_FD,
 	NETDEV_A_DMABUF_ID,
+	NETDEV_A_DMABUF_AUTORELEASE,
 
 	__NETDEV_A_DMABUF_MAX,
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 05a9a9e7abb9..05c16df657c7 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -11,6 +11,7 @@
 #include <linux/genalloc.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
+#include <linux/skbuff_ref.h>
 #include <linux/types.h>
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
@@ -28,6 +29,19 @@
 
 static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
 
+/* If the user unbinds before releasing all tokens, the static key must not
+ * change until all tokens have been released (to avoid calling the wrong
+ * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
+ * and binding alloc/free atomic with regards to each other, using the
+ * devmem_ar_lock. This works because binding free does not occur until all of
+ * the outstanding token's references on the binding are dropped.
+ */
+static DEFINE_MUTEX(devmem_ar_lock);
+
+DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
+EXPORT_SYMBOL(tcp_devmem_ar_key);
+static int net_devmem_dmabuf_rx_bindings_count;
+
 static const struct memory_provider_ops dmabuf_devmem_ops;
 
 bool net_is_devmem_iov(struct net_iov *niov)
@@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 
 	size_t size, avail;
 
+	if (binding->direction == DMA_FROM_DEVICE) {
+		mutex_lock(&devmem_ar_lock);
+		net_devmem_dmabuf_rx_bindings_count--;
+		if (net_devmem_dmabuf_rx_bindings_count == 0)
+			static_branch_disable(&tcp_devmem_ar_key);
+		mutex_unlock(&devmem_ar_lock);
+	}
+
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
 
@@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
 	gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
 }
 
+static void
+net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
+{
+	int i;
+
+	for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
+		struct net_iov *niov;
+		netmem_ref netmem;
+
+		niov = binding->vec[i];
+		netmem = net_iov_to_netmem(niov);
+
+		/* Multiple urefs map to only a single netmem ref. */
+		if (atomic_xchg(&niov->uref, 0) > 0)
+			WARN_ON_ONCE(!napi_pp_put_page(netmem));
+	}
+}
+
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 {
 	struct netdev_rx_queue *rxq;
@@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		__net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
 	}
 
+	net_devmem_dmabuf_binding_put_urefs(binding);
 	net_devmem_dmabuf_binding_put(binding);
 }
 
@@ -179,8 +220,10 @@ struct net_devmem_dmabuf_binding *
 net_devmem_bind_dmabuf(struct net_device *dev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
-		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack)
+		       unsigned int dmabuf_fd,
+		       struct netdev_nl_sock *priv,
+		       struct netlink_ext_ack *extack,
+		       bool autorelease)
 {
 	struct net_devmem_dmabuf_binding *binding;
 	static u32 id_alloc_next;
@@ -231,14 +274,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		goto err_detach;
 	}
 
-	if (direction == DMA_TO_DEVICE) {
-		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-					      sizeof(struct net_iov *),
-					      GFP_KERNEL);
-		if (!binding->vec) {
-			err = -ENOMEM;
-			goto err_unmap;
-		}
+	binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+				      sizeof(struct net_iov *),
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!binding->vec) {
+		err = -ENOMEM;
+		goto err_unmap;
 	}
 
 	/* For simplicity we expect to make PAGE_SIZE allocations, but the
@@ -292,25 +333,62 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			niov = &owner->area.niovs[i];
 			niov->type = NET_IOV_DMABUF;
 			niov->owner = &owner->area;
+			atomic_set(&niov->uref, 0);
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
-			if (direction == DMA_TO_DEVICE)
-				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+			binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
 	}
 
+	mutex_lock(&devmem_ar_lock);
+
+	if (direction == DMA_FROM_DEVICE) {
+		if (net_devmem_dmabuf_rx_bindings_count > 0) {
+			bool mode;
+
+			mode = static_key_enabled(&tcp_devmem_ar_key);
+
+			/* When bindings exist, enforce that the mode does not
+			 * change.
+			 */
+			if (mode != autorelease) {
+				NL_SET_ERR_MSG_FMT(extack,
+						   "System already configured with autorelease=%d",
+						   mode);
+				err = -EINVAL;
+				goto err_unlock_mutex;
+			}
+		} else if (autorelease) {
+			/* First binding with autorelease enabled sets the
+			 * mode.  If autorelease is false, the key is already
+			 * disabled by default so no action is needed.
+			 */
+			static_branch_enable(&tcp_devmem_ar_key);
+		}
+
+		net_devmem_dmabuf_rx_bindings_count++;
+	}
+
 	err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
 			      binding, xa_limit_32b, &id_alloc_next,
 			      GFP_KERNEL);
 	if (err < 0)
-		goto err_free_chunks;
+		goto err_dec_binding_count;
+
+	mutex_unlock(&devmem_ar_lock);
 
 	list_add(&binding->list, &priv->bindings);
 
 	return binding;
 
+err_dec_binding_count:
+	if (direction == DMA_FROM_DEVICE)
+		net_devmem_dmabuf_rx_bindings_count--;
+
+err_unlock_mutex:
+	mutex_unlock(&devmem_ar_lock);
 err_free_chunks:
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 1ea6228e4f40..8c586f30e371 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -12,9 +12,13 @@
 
 #include <net/netmem.h>
 #include <net/netdev_netlink.h>
+#include <linux/jump_label.h>
 
 struct netlink_ext_ack;
 
+/* static key for TCP devmem autorelease */
+extern struct static_key_false tcp_devmem_ar_key;
+
 struct net_devmem_dmabuf_binding {
 	struct dma_buf *dmabuf;
 	struct dma_buf_attachment *attachment;
@@ -61,7 +65,7 @@ struct net_devmem_dmabuf_binding {
 
 	/* Array of net_iov pointers for this binding, sorted by virtual
 	 * address. This array is convenient to map the virtual addresses to
-	 * net_iovs in the TX path.
+	 * net_iovs.
 	 */
 	struct net_iov **vec;
 
@@ -88,7 +92,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack);
+		       struct netlink_ext_ack *extack, bool autorelease);
 struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
 int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
@@ -138,6 +142,11 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 	schedule_work(&binding->unbind_w);
 }
 
+static inline bool net_devmem_autorelease_enabled(void)
+{
+	return static_branch_unlikely(&tcp_devmem_ar_key);
+}
+
 void net_devmem_get_net_iov(struct net_iov *niov);
 void net_devmem_put_net_iov(struct net_iov *niov);
 
@@ -155,6 +164,12 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
 #else
 struct net_devmem_dmabuf_binding;
 
+static inline bool
+net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
+{
+	return false;
+}
+
 static inline void
 net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 {
@@ -174,7 +189,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd,
 		       struct netdev_nl_sock *priv,
-		       struct netlink_ext_ack *extack)
+		       struct netlink_ext_ack *extack,
+		       bool autorelease)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
@@ -241,6 +257,11 @@ net_devmem_iov_binding(const struct net_iov *niov)
 {
 	return NULL;
 }
+
+static inline bool net_devmem_autorelease_enabled(void)
+{
+	return false;
+}
 #endif
 
 #endif /* _NET_DEVMEM_H */
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index ba673e81716f..01b7765e11ec 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
 };
 
 /* NETDEV_CMD_BIND_RX - do */
-static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
+static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
 	[NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
 	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
 	[NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
+	[NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
 };
 
 /* NETDEV_CMD_NAPI_SET - do */
@@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.cmd		= NETDEV_CMD_BIND_RX,
 		.doit		= netdev_nl_bind_rx_doit,
 		.policy		= netdev_bind_rx_nl_policy,
-		.maxattr	= NETDEV_A_DMABUF_FD,
+		.maxattr	= NETDEV_A_DMABUF_AUTORELEASE,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 	{
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 470fabbeacd9..c742bb34865e 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	struct netdev_nl_sock *priv;
 	struct net_device *netdev;
 	unsigned long *rxq_bitmap;
+	bool autorelease = false;
 	struct device *dma_dev;
 	struct sk_buff *rsp;
 	int err = 0;
@@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
 	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
 
+	if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
+		autorelease =
+			!!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
+
 	priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
 	if (IS_ERR(priv))
 		return PTR_ERR(priv);
@@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
-					 dmabuf_fd, priv, info->extack);
+					 dmabuf_fd, priv, info->extack,
+					 autorelease);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
 		goto err_rxq_bitmap;
@@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 
 	dma_dev = netdev_queue_get_dma_dev(netdev, 0);
 	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
-					 dmabuf_fd, priv, info->extack);
+					 dmabuf_fd, priv, info->extack, false);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
 		goto err_unlock_netdev;
diff --git a/net/core/sock.c b/net/core/sock.c
index f6526f43aa6e..6355c2ccfb8a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -87,6 +87,7 @@
 
 #include <linux/unaligned.h>
 #include <linux/capability.h>
+#include <linux/dma-buf.h>
 #include <linux/errno.h>
 #include <linux/errqueue.h>
 #include <linux/types.h>
@@ -151,6 +152,7 @@
 #include <uapi/linux/pidfd.h>
 
 #include "dev.h"
+#include "devmem.h"
 
 static DEFINE_MUTEX(proto_list_mutex);
 static LIST_HEAD(proto_list);
@@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_TOKENS 128
 #define MAX_DONTNEED_FRAGS 1024
 
+static noinline_for_stack int
+sock_devmem_dontneed_manual_release(struct sock *sk,
+				    struct dmabuf_token *tokens,
+				    unsigned int num_tokens)
+{
+	struct net_iov *niov;
+	unsigned int i, j;
+	netmem_ref netmem;
+	unsigned int token;
+	int num_frags = 0;
+	int ret = 0;
+
+	if (!sk->sk_devmem_info.binding)
+		return -EINVAL;
+
+	for (i = 0; i < num_tokens; i++) {
+		for (j = 0; j < tokens[i].token_count; j++) {
+			size_t size = sk->sk_devmem_info.binding->dmabuf->size;
+
+			token = tokens[i].token_start + j;
+			if (token >= size / PAGE_SIZE)
+				break;
+
+			if (++num_frags > MAX_DONTNEED_FRAGS)
+				return ret;
+
+			niov = sk->sk_devmem_info.binding->vec[token];
+			if (atomic_dec_and_test(&niov->uref)) {
+				netmem = net_iov_to_netmem(niov);
+				WARN_ON_ONCE(!napi_pp_put_page(netmem));
+			}
+			ret++;
+		}
+	}
+
+	return ret;
+}
+
 static noinline_for_stack int
 sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 				 unsigned int num_tokens)
@@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
 			if (++num_frags > MAX_DONTNEED_FRAGS)
 				goto frag_limit_reached;
 
 			netmem_ref netmem = (__force netmem_ref)__xa_erase(
-				&sk->sk_user_frags, tokens[i].token_start + j);
+				&sk->sk_devmem_info.frags,
+				tokens[i].token_start + j);
 
 			if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
 				continue;
 
 			netmems[netmem_num++] = netmem;
 			if (netmem_num == ARRAY_SIZE(netmems)) {
-				xa_unlock_bh(&sk->sk_user_frags);
+				xa_unlock_bh(&sk->sk_devmem_info.frags);
 				for (k = 0; k < netmem_num; k++)
 					WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 				netmem_num = 0;
-				xa_lock_bh(&sk->sk_user_frags);
+				xa_lock_bh(&sk->sk_devmem_info.frags);
 			}
 			ret++;
 		}
 	}
 
 frag_limit_reached:
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
@@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 		return -EFAULT;
 	}
 
-	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	if (net_devmem_autorelease_enabled())
+		ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	else
+		ret = sock_devmem_dontneed_manual_release(sk, tokens,
+							  num_tokens);
 
 	kvfree(tokens);
 	return ret;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5319ebe2452..a8a4af552909 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -260,6 +260,7 @@
 #include <linux/memblock.h>
 #include <linux/highmem.h>
 #include <linux/cache.h>
+#include <linux/dma-buf.h>
 #include <linux/err.h>
 #include <linux/time.h>
 #include <linux/slab.h>
@@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
 
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	sk_sockets_allocated_inc(sk);
-	xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	sk->sk_devmem_info.binding = NULL;
 }
 EXPORT_IPV6_MOD(tcp_init_sock);
 
@@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 	/* Commit part that has been copied to user space. */
 	for (i = 0; i < p->idx; i++)
-		__xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
-			     (__force void *)p->netmems[i], GFP_KERNEL);
+		__xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
+			     XA_ZERO_ENTRY, (__force void *)p->netmems[i],
+			     GFP_KERNEL);
 	/* Rollback what has been pre-allocated and is no longer needed. */
 	for (; i < p->max; i++)
-		__xa_erase(&sk->sk_user_frags, p->tokens[i]);
+		__xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
 
 	p->max = 0;
 	p->idx = 0;
@@ -2436,14 +2439,18 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
 {
+	/* Skip xarray operations if autorelease is disabled (manual mode) */
+	if (!net_devmem_autorelease_enabled())
+		return;
+
 	if (!p->max)
 		return;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 }
 
 static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
@@ -2454,24 +2461,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
 	if (p->idx < p->max)
 		return 0;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
 	for (k = 0; k < max_frags; k++) {
-		err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
+		err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
 				 XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
 		if (err)
 			break;
 	}
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 
 	p->max = k;
 	p->idx = 0;
 	return k ? 0 : err;
 }
 
+static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
+					 skb_frag_t *frag)
+{
+	struct net_iov *niov;
+
+	niov = skb_frag_net_iov(frag);
+
+	if (net_devmem_autorelease_enabled()) {
+		atomic_long_inc(&niov->desc.pp_ref_count);
+		tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
+			skb_frag_netmem(frag);
+	} else {
+		if (atomic_inc_return(&niov->uref) == 1)
+			atomic_long_inc(&niov->desc.pp_ref_count);
+	}
+}
+
 /* On error, returns the -errno. On success, returns number of bytes sent to the
  * user. May not consume all of @remaining_len.
  */
@@ -2533,6 +2557,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 		 * sequence of cmsg
 		 */
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct net_devmem_dmabuf_binding *binding = NULL;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 			struct net_iov *niov;
 			u64 frag_offset;
@@ -2568,13 +2593,35 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 					      start;
 				dmabuf_cmsg.frag_offset = frag_offset;
 				dmabuf_cmsg.frag_size = copy;
-				err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
-							 skb_shinfo(skb)->nr_frags - i);
-				if (err)
+
+				binding = net_devmem_iov_binding(niov);
+
+				if (!sk->sk_devmem_info.binding) {
+					net_devmem_dmabuf_binding_get(binding);
+					sk->sk_devmem_info.binding = binding;
+				}
+
+				if (sk->sk_devmem_info.binding != binding) {
+					err = -EFAULT;
 					goto out;
+				}
+
+				if (net_devmem_autorelease_enabled()) {
+					err = tcp_xa_pool_refill(sk,
+								 &tcp_xa_pool,
+								 skb_shinfo(skb)->nr_frags - i);
+					if (err)
+						goto out;
+
+					dmabuf_cmsg.frag_token =
+						tcp_xa_pool.tokens[tcp_xa_pool.idx];
+				} else {
+					dmabuf_cmsg.frag_token =
+						net_iov_virtual_addr(niov) >> PAGE_SHIFT;
+				}
+
 
 				/* Will perform the exchange later */
-				dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
 				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
 
 				offset += copy;
@@ -2587,8 +2634,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 				if (err)
 					goto out;
 
-				atomic_long_inc(&niov->desc.pp_ref_count);
-				tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
+				tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
 
 				sent += copy;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f8a9596e8f4d..7b1b5a17002f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -89,6 +89,9 @@
 
 #include <crypto/md5.h>
 
+#include <linux/dma-buf.h>
+#include "../core/devmem.h"
+
 #include <trace/events/tcp.h>
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
 	unsigned long index;
 	void *netmem;
 
-	xa_for_each(&sk->sk_user_frags, index, netmem)
+	xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
 		WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
 #endif
 }
@@ -2503,7 +2506,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
 
 	tcp_release_user_frags(sk);
 
-	xa_destroy(&sk->sk_user_frags);
+	xa_destroy(&sk->sk_devmem_info.frags);
+	if (sk->sk_devmem_info.binding) {
+		net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
+		sk->sk_devmem_info.binding = NULL;
+	}
 
 	trace_tcp_destroy_sock(sk);
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd5462154f97..2aec977f5c12 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 
-	xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	newsk->sk_devmem_info.binding = NULL;
 
 	return newsk;
 }
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e0b579a1df4f..1e5c209cb998 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -207,6 +207,7 @@ enum {
 	NETDEV_A_DMABUF_QUEUES,
 	NETDEV_A_DMABUF_FD,
 	NETDEV_A_DMABUF_ID,
+	NETDEV_A_DMABUF_AUTORELEASE,
 
 	__NETDEV_A_DMABUF_MAX,
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-10  2:18 ` [PATCH net-next v9 3/5] net: devmem: implement autorelease token management Bobby Eshleman
@ 2026-01-11 19:12   ` Mina Almasry
  2026-01-11 19:27     ` Mina Almasry
  2026-01-12 16:24     ` Bobby Eshleman
  2026-01-13  4:00   ` [net-next,v9,3/5] " Jakub Kicinski
  1 sibling, 2 replies; 21+ messages in thread
From: Mina Almasry @ 2026-01-11 19:12 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@meta.com>
>
> Add support for autorelease toggling of tokens using a static branch to
> control system-wide behavior. This allows applications to choose between
> two memory management modes:
>
> 1. Autorelease on: Leaked tokens are automatically released when the
>    socket closes.
>
> 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
>
> The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> binding is disallowed and is rejected by netlink. The system will be
> "locked" into the mode that the first binding is set to. It can only be
> changed again once there are zero bindings on the system.
>
> Disabling autorelease offers ~13% improvement in CPU utilization.
>
> Static branching is used to limit the system to one mode or the other.
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> ---
> Changes in v9:
> - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> - Add wrapper around tcp_devmem_ar_key accesses so that it may be
>   stubbed out when NET_DEVMEM=n
> - only dec rx binding count for rx bindings in free (v8 did not exclude
>   TX bindings)
>
> Changes in v8:
> - Only reset static key when bindings go to zero, defaulting back to
>   disabled (Stan).
> - Fix bad usage of xarray spinlock for sleepy static branch switching,
>   use mutex instead.
> - Access pp_ref_count via niov->desc instead of niov directly.
> - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
>   the static key can not be changed while there are outstanding tokens
>   (free is only called when reference count reaches zero).
> - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
>   even after xa_erase(), so static key changes must wait until all
>   RX bindings are finally freed (not just when xarray is empty). A
>   counter is a simple way to track this.
> - socket takes reference on the binding, to avoid use-after-free on
>   sk_devmem_info.binding in the case that user releases all tokens,
>   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> - removed some comments that were unnecessary
>
> Changes in v7:
> - implement autorelease with static branch (Stan)
> - use netlink instead of sockopt (Stan)
> - merge uAPI and implementation patches into one patch (seemed less
>   confusing)
>
> Changes in v6:
> - remove sk_devmem_info.autorelease, using binding->autorelease instead
> - move binding->autorelease check to outside of
>   net_devmem_dmabuf_binding_put_urefs() (Mina)
> - remove overly defensive net_is_devmem_iov() (Mina)
> - add comment about multiple urefs mapping to a single netmem ref (Mina)
> - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> - use niov without casting back and forth with netmem (Mina)
> - move the autorelease flag from per-binding to per-socket (Mina)
> - remove the batching logic in sock_devmem_dontneed_manual_release()
>   (Mina)
> - move autorelease check inside tcp_xa_pool_commit() (Mina)
> - remove single-binding restriction for autorelease mode (Mina)
> - unbind always checks for leaked urefs
>
> Changes in v5:
> - remove unused variables
> - introduce autorelease flag, preparing for future patch toggle new
>   behavior
>
> Changes in v3:
> - make urefs per-binding instead of per-socket, reducing memory
>   footprint
> - fallback to cleaning up references in dmabuf unbind if socket leaked
>   tokens
> - drop ethtool patch
>
> Changes in v2:
> - always use GFP_ZERO for binding->vec (Mina)
> - remove WARN for changed binding (Mina)
> - remove extraneous binding ref get (Mina)
> - remove WARNs on invalid user input (Mina)
> - pre-assign niovs in binding->vec for RX case (Mina)
> - use atomic_set(, 0) to initialize sk_user_frags.urefs
> - fix length of alloc for urefs
> ---
>  Documentation/netlink/specs/netdev.yaml |  12 ++++
>  include/net/netmem.h                    |   1 +
>  include/net/sock.h                      |   7 ++-
>  include/uapi/linux/netdev.h             |   1 +
>  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
>  net/core/devmem.h                       |  27 ++++++++-
>  net/core/netdev-genl-gen.c              |   5 +-
>  net/core/netdev-genl.c                  |  10 ++-
>  net/core/sock.c                         |  57 +++++++++++++++--
>  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
>  net/ipv4/tcp_ipv4.c                     |  11 +++-
>  net/ipv4/tcp_minisocks.c                |   3 +-
>  tools/include/uapi/linux/netdev.h       |   1 +
>  13 files changed, 269 insertions(+), 46 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 596c306ce52b..7cbe9e7b9ee5 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -562,6 +562,17 @@ attribute-sets:
>          type: u32
>          checks:
>            min: 1
> +      -
> +        name: autorelease
> +        doc: |
> +          Token autorelease mode. If true (1), leaked tokens are automatically
> +          released when the socket closes. If false (0), leaked tokens are only
> +          released when the dmabuf is unbound. Once a binding is created with a
> +          specific mode, all subsequent bindings system-wide must use the same
> +          mode.
> +
> +          Optional. Defaults to false if not specified.
> +        type: u8
>
>  operations:
>    list:
> @@ -769,6 +780,7 @@ operations:
>              - ifindex
>              - fd
>              - queues
> +            - autorelease
>          reply:
>            attributes:
>              - id
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 9e10f4ac50c3..80d2263ba4ed 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -112,6 +112,7 @@ struct net_iov {
>         };
>         struct net_iov_area *owner;
>         enum net_iov_type type;
> +       atomic_t uref;
>  };
>
>  struct net_iov_area {
> diff --git a/include/net/sock.h b/include/net/sock.h
> index aafe8bdb2c0f..9d3d5bde15e9 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -352,7 +352,7 @@ struct sk_filter;
>    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
>    *    @sk_scm_unused: unused flags for scm_recv()
>    *    @ns_tracker: tracker for netns reference
> -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> +  *    @sk_devmem_info: the devmem binding information for the socket
>    *    @sk_owner: reference to the real owner of the socket that calls
>    *               sock_lock_init_class_and_name().
>    */
> @@ -584,7 +584,10 @@ struct sock {
>         struct numa_drop_counters *sk_drop_counters;
>         struct rcu_head         sk_rcu;
>         netns_tracker           ns_tracker;
> -       struct xarray           sk_user_frags;
> +       struct {
> +               struct xarray                           frags;
> +               struct net_devmem_dmabuf_binding        *binding;
> +       } sk_devmem_info;
>
>  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
>         struct module           *sk_owner;
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index e0b579a1df4f..1e5c209cb998 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -207,6 +207,7 @@ enum {
>         NETDEV_A_DMABUF_QUEUES,
>         NETDEV_A_DMABUF_FD,
>         NETDEV_A_DMABUF_ID,
> +       NETDEV_A_DMABUF_AUTORELEASE,
>
>         __NETDEV_A_DMABUF_MAX,
>         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 05a9a9e7abb9..05c16df657c7 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/mm.h>
>  #include <linux/netdevice.h>
> +#include <linux/skbuff_ref.h>
>  #include <linux/types.h>
>  #include <net/netdev_queues.h>
>  #include <net/netdev_rx_queue.h>
> @@ -28,6 +29,19 @@
>
>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>
> +/* If the user unbinds before releasing all tokens, the static key must not
> + * change until all tokens have been released (to avoid calling the wrong
> + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> + * and binding alloc/free atomic with regards to each other, using the
> + * devmem_ar_lock. This works because binding free does not occur until all of
> + * the outstanding token's references on the binding are dropped.
> + */
> +static DEFINE_MUTEX(devmem_ar_lock);
> +
> +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> +EXPORT_SYMBOL(tcp_devmem_ar_key);
> +static int net_devmem_dmabuf_rx_bindings_count;
> +
>  static const struct memory_provider_ops dmabuf_devmem_ops;
>
>  bool net_is_devmem_iov(struct net_iov *niov)
> @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
>
>         size_t size, avail;
>
> +       if (binding->direction == DMA_FROM_DEVICE) {
> +               mutex_lock(&devmem_ar_lock);
> +               net_devmem_dmabuf_rx_bindings_count--;
> +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> +                       static_branch_disable(&tcp_devmem_ar_key);
> +               mutex_unlock(&devmem_ar_lock);
> +       }
> +

I find this loging with devmem_ar_lock and
net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
can do another simplification here? Can we have it such that the first
binding sets the system in autorelease on or autorelease off mode, and
all future bindings maintain this state? We already don't support
autorelease on/off mix.


>         gen_pool_for_each_chunk(binding->chunk_pool,
>                                 net_devmem_dmabuf_free_chunk_owner, NULL);
>
> @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
>         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
>  }
>
> +static void
> +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> +{
> +       int i;
> +
> +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> +               struct net_iov *niov;
> +               netmem_ref netmem;
> +
> +               niov = binding->vec[i];
> +               netmem = net_iov_to_netmem(niov);
> +
> +               /* Multiple urefs map to only a single netmem ref. */
> +               if (atomic_xchg(&niov->uref, 0) > 0)
> +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +       }
> +}
> +
>  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>  {
>         struct netdev_rx_queue *rxq;
> @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
>         }
>
> +       net_devmem_dmabuf_binding_put_urefs(binding);

Sigh, I think what you're trying to do here is very complicated. You
need to think about this scenario:

1. user binds dmabuf and opens a autorelease=off socket.
2. Data arrives on these sockets, and sits in the receive queues,
recvmsg has not been called yet by the user.
3. User unbinds the dma-buff, netmems are still in the receive queues.
4. User calls recvmsg on one of these sockets, which obtains a uref on
the netmems in the receive queues.
5. user closes the socket.

With autorelease=on, this works, because the binding remains alive
until step 5 (even though it's unbound from the queue,
..._binding_free has not been called yet) and step 5 cleans up all
references, even if the binding is unbound but alive, and

calling net_devmem_dmabuf_binding_put_urefs here is weird.
Autorelease=off implies the user must clean their urefs themselves,
but we have this here in the unbind path, and it doesn't even
guarantee that the urefs are free at this point because it may race
with a recvmsg.

Should we delete this uref cleanup here, and enforce that
autorelease=off means that the user cleans up the references (the
kernel never cleans them up on unbind or socket close)? The dontneed
path needs to work whether the binding is active or unbound.

>         net_devmem_dmabuf_binding_put(binding);
>  }
>
> @@ -179,8 +220,10 @@ struct net_devmem_dmabuf_binding *
>  net_devmem_bind_dmabuf(struct net_device *dev,
>                        struct device *dma_dev,
>                        enum dma_data_direction direction,
> -                      unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack)
> +                      unsigned int dmabuf_fd,
> +                      struct netdev_nl_sock *priv,
> +                      struct netlink_ext_ack *extack,
> +                      bool autorelease)
>  {
>         struct net_devmem_dmabuf_binding *binding;
>         static u32 id_alloc_next;
> @@ -231,14 +274,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                 goto err_detach;
>         }
>
> -       if (direction == DMA_TO_DEVICE) {
> -               binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> -                                             sizeof(struct net_iov *),
> -                                             GFP_KERNEL);
> -               if (!binding->vec) {
> -                       err = -ENOMEM;
> -                       goto err_unmap;
> -               }
> +       binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> +                                     sizeof(struct net_iov *),
> +                                     GFP_KERNEL | __GFP_ZERO);
> +       if (!binding->vec) {
> +               err = -ENOMEM;
> +               goto err_unmap;
>         }
>
>         /* For simplicity we expect to make PAGE_SIZE allocations, but the
> @@ -292,25 +333,62 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                         niov = &owner->area.niovs[i];
>                         niov->type = NET_IOV_DMABUF;
>                         niov->owner = &owner->area;
> +                       atomic_set(&niov->uref, 0);
>                         page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
>                                                       net_devmem_get_dma_addr(niov));
> -                       if (direction == DMA_TO_DEVICE)
> -                               binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> +                       binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
>                 }
>
>                 virtual += len;
>         }
>
> +       mutex_lock(&devmem_ar_lock);
> +
> +       if (direction == DMA_FROM_DEVICE) {
> +               if (net_devmem_dmabuf_rx_bindings_count > 0) {
> +                       bool mode;
> +
> +                       mode = static_key_enabled(&tcp_devmem_ar_key);
> +
> +                       /* When bindings exist, enforce that the mode does not
> +                        * change.
> +                        */
> +                       if (mode != autorelease) {
> +                               NL_SET_ERR_MSG_FMT(extack,
> +                                                  "System already configured with autorelease=%d",
> +                                                  mode);
> +                               err = -EINVAL;
> +                               goto err_unlock_mutex;
> +                       }
> +               } else if (autorelease) {
> +                       /* First binding with autorelease enabled sets the
> +                        * mode.  If autorelease is false, the key is already
> +                        * disabled by default so no action is needed.
> +                        */
> +                       static_branch_enable(&tcp_devmem_ar_key);
> +               }
> +
> +               net_devmem_dmabuf_rx_bindings_count++;
> +       }
> +
>         err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
>                               binding, xa_limit_32b, &id_alloc_next,
>                               GFP_KERNEL);
>         if (err < 0)
> -               goto err_free_chunks;
> +               goto err_dec_binding_count;
> +
> +       mutex_unlock(&devmem_ar_lock);
>
>         list_add(&binding->list, &priv->bindings);
>
>         return binding;
>
> +err_dec_binding_count:
> +       if (direction == DMA_FROM_DEVICE)
> +               net_devmem_dmabuf_rx_bindings_count--;
> +
> +err_unlock_mutex:
> +       mutex_unlock(&devmem_ar_lock);
>  err_free_chunks:
>         gen_pool_for_each_chunk(binding->chunk_pool,
>                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> diff --git a/net/core/devmem.h b/net/core/devmem.h
> index 1ea6228e4f40..8c586f30e371 100644
> --- a/net/core/devmem.h
> +++ b/net/core/devmem.h
> @@ -12,9 +12,13 @@
>
>  #include <net/netmem.h>
>  #include <net/netdev_netlink.h>
> +#include <linux/jump_label.h>
>
>  struct netlink_ext_ack;
>
> +/* static key for TCP devmem autorelease */
> +extern struct static_key_false tcp_devmem_ar_key;
> +
>  struct net_devmem_dmabuf_binding {
>         struct dma_buf *dmabuf;
>         struct dma_buf_attachment *attachment;
> @@ -61,7 +65,7 @@ struct net_devmem_dmabuf_binding {
>
>         /* Array of net_iov pointers for this binding, sorted by virtual
>          * address. This array is convenient to map the virtual addresses to
> -        * net_iovs in the TX path.
> +        * net_iovs.
>          */
>         struct net_iov **vec;
>
> @@ -88,7 +92,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                        struct device *dma_dev,
>                        enum dma_data_direction direction,
>                        unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack);
> +                      struct netlink_ext_ack *extack, bool autorelease);
>  struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
>  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
>  int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> @@ -138,6 +142,11 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
>         schedule_work(&binding->unbind_w);
>  }
>
> +static inline bool net_devmem_autorelease_enabled(void)
> +{
> +       return static_branch_unlikely(&tcp_devmem_ar_key);
> +}
> +
>  void net_devmem_get_net_iov(struct net_iov *niov);
>  void net_devmem_put_net_iov(struct net_iov *niov);
>
> @@ -155,6 +164,12 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
>  #else
>  struct net_devmem_dmabuf_binding;
>
> +static inline bool
> +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> +{
> +       return false;
> +}
> +
>  static inline void
>  net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
>  {
> @@ -174,7 +189,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                        enum dma_data_direction direction,
>                        unsigned int dmabuf_fd,
>                        struct netdev_nl_sock *priv,
> -                      struct netlink_ext_ack *extack)
> +                      struct netlink_ext_ack *extack,
> +                      bool autorelease)
>  {
>         return ERR_PTR(-EOPNOTSUPP);
>  }
> @@ -241,6 +257,11 @@ net_devmem_iov_binding(const struct net_iov *niov)
>  {
>         return NULL;
>  }
> +
> +static inline bool net_devmem_autorelease_enabled(void)
> +{
> +       return false;
> +}
>  #endif
>
>  #endif /* _NET_DEVMEM_H */
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index ba673e81716f..01b7765e11ec 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
>  };
>
>  /* NETDEV_CMD_BIND_RX - do */
> -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
> +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
>         [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
>         [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
>         [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> +       [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
>  };
>
>  /* NETDEV_CMD_NAPI_SET - do */
> @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>                 .cmd            = NETDEV_CMD_BIND_RX,
>                 .doit           = netdev_nl_bind_rx_doit,
>                 .policy         = netdev_bind_rx_nl_policy,
> -               .maxattr        = NETDEV_A_DMABUF_FD,
> +               .maxattr        = NETDEV_A_DMABUF_AUTORELEASE,
>                 .flags          = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>         },
>         {
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 470fabbeacd9..c742bb34865e 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         struct netdev_nl_sock *priv;
>         struct net_device *netdev;
>         unsigned long *rxq_bitmap;
> +       bool autorelease = false;
>         struct device *dma_dev;
>         struct sk_buff *rsp;
>         int err = 0;
> @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
>         dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
>
> +       if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> +               autorelease =
> +                       !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> +
>         priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
>         if (IS_ERR(priv))
>                 return PTR_ERR(priv);
> @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>         }
>
>         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> -                                        dmabuf_fd, priv, info->extack);
> +                                        dmabuf_fd, priv, info->extack,
> +                                        autorelease);
>         if (IS_ERR(binding)) {
>                 err = PTR_ERR(binding);
>                 goto err_rxq_bitmap;
> @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>
>         dma_dev = netdev_queue_get_dma_dev(netdev, 0);
>         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> -                                        dmabuf_fd, priv, info->extack);
> +                                        dmabuf_fd, priv, info->extack, false);
>         if (IS_ERR(binding)) {
>                 err = PTR_ERR(binding);
>                 goto err_unlock_netdev;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index f6526f43aa6e..6355c2ccfb8a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -87,6 +87,7 @@
>
>  #include <linux/unaligned.h>
>  #include <linux/capability.h>
> +#include <linux/dma-buf.h>
>  #include <linux/errno.h>
>  #include <linux/errqueue.h>
>  #include <linux/types.h>
> @@ -151,6 +152,7 @@
>  #include <uapi/linux/pidfd.h>
>
>  #include "dev.h"
> +#include "devmem.h"
>
>  static DEFINE_MUTEX(proto_list_mutex);
>  static LIST_HEAD(proto_list);
> @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
>  #define MAX_DONTNEED_TOKENS 128
>  #define MAX_DONTNEED_FRAGS 1024
>
> +static noinline_for_stack int
> +sock_devmem_dontneed_manual_release(struct sock *sk,
> +                                   struct dmabuf_token *tokens,
> +                                   unsigned int num_tokens)
> +{
> +       struct net_iov *niov;
> +       unsigned int i, j;
> +       netmem_ref netmem;
> +       unsigned int token;
> +       int num_frags = 0;
> +       int ret = 0;
> +
> +       if (!sk->sk_devmem_info.binding)
> +               return -EINVAL;
> +
> +       for (i = 0; i < num_tokens; i++) {
> +               for (j = 0; j < tokens[i].token_count; j++) {
> +                       size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> +
> +                       token = tokens[i].token_start + j;
> +                       if (token >= size / PAGE_SIZE)
> +                               break;
> +
> +                       if (++num_frags > MAX_DONTNEED_FRAGS)
> +                               return ret;
> +
> +                       niov = sk->sk_devmem_info.binding->vec[token];
> +                       if (atomic_dec_and_test(&niov->uref)) {
> +                               netmem = net_iov_to_netmem(niov);
> +                               WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +                       }
> +                       ret++;
> +               }
> +       }
> +
> +       return ret;
> +}
> +
>  static noinline_for_stack int
>  sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
>                                  unsigned int num_tokens)
> @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
>         int ret = 0, num_frags = 0;
>         netmem_ref netmems[16];
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>         for (i = 0; i < num_tokens; i++) {
>                 for (j = 0; j < tokens[i].token_count; j++) {
>                         if (++num_frags > MAX_DONTNEED_FRAGS)
>                                 goto frag_limit_reached;
>
>                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
> -                               &sk->sk_user_frags, tokens[i].token_start + j);
> +                               &sk->sk_devmem_info.frags,
> +                               tokens[i].token_start + j);
>
>                         if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
>                                 continue;
>
>                         netmems[netmem_num++] = netmem;
>                         if (netmem_num == ARRAY_SIZE(netmems)) {
> -                               xa_unlock_bh(&sk->sk_user_frags);
> +                               xa_unlock_bh(&sk->sk_devmem_info.frags);
>                                 for (k = 0; k < netmem_num; k++)
>                                         WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
>                                 netmem_num = 0;
> -                               xa_lock_bh(&sk->sk_user_frags);
> +                               xa_lock_bh(&sk->sk_devmem_info.frags);
>                         }
>                         ret++;
>                 }
>         }
>
>  frag_limit_reached:
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>         for (k = 0; k < netmem_num; k++)
>                 WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
>
> @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
>                 return -EFAULT;
>         }
>
> -       ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> +       if (net_devmem_autorelease_enabled())
> +               ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> +       else
> +               ret = sock_devmem_dontneed_manual_release(sk, tokens,
> +                                                         num_tokens);
>
>         kvfree(tokens);
>         return ret;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index d5319ebe2452..a8a4af552909 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -260,6 +260,7 @@
>  #include <linux/memblock.h>
>  #include <linux/highmem.h>
>  #include <linux/cache.h>
> +#include <linux/dma-buf.h>
>  #include <linux/err.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
>
>         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>         sk_sockets_allocated_inc(sk);
> -       xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> +       xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> +       sk->sk_devmem_info.binding = NULL;
>  }
>  EXPORT_IPV6_MOD(tcp_init_sock);
>
> @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
>
>         /* Commit part that has been copied to user space. */
>         for (i = 0; i < p->idx; i++)
> -               __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> -                            (__force void *)p->netmems[i], GFP_KERNEL);
> +               __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> +                            XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> +                            GFP_KERNEL);
>         /* Rollback what has been pre-allocated and is no longer needed. */
>         for (; i < p->max; i++)
> -               __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> +               __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
>
>         p->max = 0;
>         p->idx = 0;
> @@ -2436,14 +2439,18 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
>
>  static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
>  {
> +       /* Skip xarray operations if autorelease is disabled (manual mode) */
> +       if (!net_devmem_autorelease_enabled())
> +               return;
> +
>         if (!p->max)
>                 return;
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>
>         tcp_xa_pool_commit_locked(sk, p);
>
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>  }
>
>  static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> @@ -2454,24 +2461,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
>         if (p->idx < p->max)
>                 return 0;
>
> -       xa_lock_bh(&sk->sk_user_frags);
> +       xa_lock_bh(&sk->sk_devmem_info.frags);
>
>         tcp_xa_pool_commit_locked(sk, p);
>
>         for (k = 0; k < max_frags; k++) {
> -               err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> +               err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
>                                  XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
>                 if (err)
>                         break;
>         }
>
> -       xa_unlock_bh(&sk->sk_user_frags);
> +       xa_unlock_bh(&sk->sk_devmem_info.frags);
>
>         p->max = k;
>         p->idx = 0;
>         return k ? 0 : err;
>  }
>
> +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> +                                        skb_frag_t *frag)
> +{
> +       struct net_iov *niov;
> +
> +       niov = skb_frag_net_iov(frag);
> +
> +       if (net_devmem_autorelease_enabled()) {
> +               atomic_long_inc(&niov->desc.pp_ref_count);
> +               tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> +                       skb_frag_netmem(frag);
> +       } else {
> +               if (atomic_inc_return(&niov->uref) == 1)
> +                       atomic_long_inc(&niov->desc.pp_ref_count);
> +       }
> +}
> +
>  /* On error, returns the -errno. On success, returns number of bytes sent to the
>   * user. May not consume all of @remaining_len.
>   */
> @@ -2533,6 +2557,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                  * sequence of cmsg
>                  */
>                 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +                       struct net_devmem_dmabuf_binding *binding = NULL;
>                         skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>                         struct net_iov *niov;
>                         u64 frag_offset;
> @@ -2568,13 +2593,35 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                                               start;
>                                 dmabuf_cmsg.frag_offset = frag_offset;
>                                 dmabuf_cmsg.frag_size = copy;
> -                               err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> -                                                        skb_shinfo(skb)->nr_frags - i);
> -                               if (err)
> +
> +                               binding = net_devmem_iov_binding(niov);
> +
> +                               if (!sk->sk_devmem_info.binding) {
> +                                       net_devmem_dmabuf_binding_get(binding);
> +                                       sk->sk_devmem_info.binding = binding;
> +                               }
> +
> +                               if (sk->sk_devmem_info.binding != binding) {
> +                                       err = -EFAULT;
>                                         goto out;
> +                               }
> +
> +                               if (net_devmem_autorelease_enabled()) {
> +                                       err = tcp_xa_pool_refill(sk,
> +                                                                &tcp_xa_pool,
> +                                                                skb_shinfo(skb)->nr_frags - i);
> +                                       if (err)
> +                                               goto out;
> +
> +                                       dmabuf_cmsg.frag_token =
> +                                               tcp_xa_pool.tokens[tcp_xa_pool.idx];
> +                               } else {
> +                                       dmabuf_cmsg.frag_token =
> +                                               net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> +                               }
> +
>
>                                 /* Will perform the exchange later */
> -                               dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
>                                 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
>
>                                 offset += copy;
> @@ -2587,8 +2634,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                                 if (err)
>                                         goto out;
>
> -                               atomic_long_inc(&niov->desc.pp_ref_count);
> -                               tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> +                               tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
>
>                                 sent += copy;
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index f8a9596e8f4d..7b1b5a17002f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -89,6 +89,9 @@
>
>  #include <crypto/md5.h>
>
> +#include <linux/dma-buf.h>
> +#include "../core/devmem.h"
> +
>  #include <trace/events/tcp.h>
>
>  #ifdef CONFIG_TCP_MD5SIG
> @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
>         unsigned long index;
>         void *netmem;
>
> -       xa_for_each(&sk->sk_user_frags, index, netmem)
> +       xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
>                 WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
>  #endif
>  }
> @@ -2503,7 +2506,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
>
>         tcp_release_user_frags(sk);
>
> -       xa_destroy(&sk->sk_user_frags);
> +       xa_destroy(&sk->sk_devmem_info.frags);
> +       if (sk->sk_devmem_info.binding) {
> +               net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);

I don't understand the refcounting relationsship between the binding
and the socket when autorelease=off. It seems you're grabbing a
reference on the binding on every recvmsg(), but only dropping the
reference on the binding once here, so the binding is never freed.

> +               sk->sk_devmem_info.binding = NULL;
> +       }
>
>         trace_tcp_destroy_sock(sk);
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index bd5462154f97..2aec977f5c12 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>
>         __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>
> -       xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> +       xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> +       newsk->sk_devmem_info.binding = NULL;
>
>         return newsk;
>  }
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index e0b579a1df4f..1e5c209cb998 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -207,6 +207,7 @@ enum {
>         NETDEV_A_DMABUF_QUEUES,
>         NETDEV_A_DMABUF_FD,
>         NETDEV_A_DMABUF_ID,
> +       NETDEV_A_DMABUF_AUTORELEASE,
>
>         __NETDEV_A_DMABUF_MAX,
>         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
>
> --
> 2.47.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-11 19:12   ` Mina Almasry
@ 2026-01-11 19:27     ` Mina Almasry
  2026-01-12 17:42       ` Bobby Eshleman
  2026-01-12 16:24     ` Bobby Eshleman
  1 sibling, 1 reply; 21+ messages in thread
From: Mina Almasry @ 2026-01-11 19:27 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Sun, Jan 11, 2026 at 11:12 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Add support for autorelease toggling of tokens using a static branch to
> > control system-wide behavior. This allows applications to choose between
> > two memory management modes:
> >
> > 1. Autorelease on: Leaked tokens are automatically released when the
> >    socket closes.
> >
> > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> >
> > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > binding is disallowed and is rejected by netlink. The system will be
> > "locked" into the mode that the first binding is set to. It can only be
> > changed again once there are zero bindings on the system.
> >
> > Disabling autorelease offers ~13% improvement in CPU utilization.
> >
> > Static branching is used to limit the system to one mode or the other.
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > ---
> > Changes in v9:
> > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> >   stubbed out when NET_DEVMEM=n
> > - only dec rx binding count for rx bindings in free (v8 did not exclude
> >   TX bindings)
> >
> > Changes in v8:
> > - Only reset static key when bindings go to zero, defaulting back to
> >   disabled (Stan).
> > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> >   use mutex instead.
> > - Access pp_ref_count via niov->desc instead of niov directly.
> > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> >   the static key can not be changed while there are outstanding tokens
> >   (free is only called when reference count reaches zero).
> > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> >   even after xa_erase(), so static key changes must wait until all
> >   RX bindings are finally freed (not just when xarray is empty). A
> >   counter is a simple way to track this.
> > - socket takes reference on the binding, to avoid use-after-free on
> >   sk_devmem_info.binding in the case that user releases all tokens,
> >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > - removed some comments that were unnecessary
> >
> > Changes in v7:
> > - implement autorelease with static branch (Stan)
> > - use netlink instead of sockopt (Stan)
> > - merge uAPI and implementation patches into one patch (seemed less
> >   confusing)
> >
> > Changes in v6:
> > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > - move binding->autorelease check to outside of
> >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > - remove overly defensive net_is_devmem_iov() (Mina)
> > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > - use niov without casting back and forth with netmem (Mina)
> > - move the autorelease flag from per-binding to per-socket (Mina)
> > - remove the batching logic in sock_devmem_dontneed_manual_release()
> >   (Mina)
> > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > - remove single-binding restriction for autorelease mode (Mina)
> > - unbind always checks for leaked urefs
> >
> > Changes in v5:
> > - remove unused variables
> > - introduce autorelease flag, preparing for future patch toggle new
> >   behavior
> >
> > Changes in v3:
> > - make urefs per-binding instead of per-socket, reducing memory
> >   footprint
> > - fallback to cleaning up references in dmabuf unbind if socket leaked
> >   tokens
> > - drop ethtool patch
> >
> > Changes in v2:
> > - always use GFP_ZERO for binding->vec (Mina)
> > - remove WARN for changed binding (Mina)
> > - remove extraneous binding ref get (Mina)
> > - remove WARNs on invalid user input (Mina)
> > - pre-assign niovs in binding->vec for RX case (Mina)
> > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > - fix length of alloc for urefs
> > ---
> >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> >  include/net/netmem.h                    |   1 +
> >  include/net/sock.h                      |   7 ++-
> >  include/uapi/linux/netdev.h             |   1 +
> >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> >  net/core/devmem.h                       |  27 ++++++++-
> >  net/core/netdev-genl-gen.c              |   5 +-
> >  net/core/netdev-genl.c                  |  10 ++-
> >  net/core/sock.c                         |  57 +++++++++++++++--
> >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> >  net/ipv4/tcp_minisocks.c                |   3 +-
> >  tools/include/uapi/linux/netdev.h       |   1 +
> >  13 files changed, 269 insertions(+), 46 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..7cbe9e7b9ee5 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> >          type: u32
> >          checks:
> >            min: 1
> > +      -
> > +        name: autorelease
> > +        doc: |
> > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > +          released when the socket closes. If false (0), leaked tokens are only
> > +          released when the dmabuf is unbound. Once a binding is created with a
> > +          specific mode, all subsequent bindings system-wide must use the same
> > +          mode.
> > +
> > +          Optional. Defaults to false if not specified.
> > +        type: u8
> >
> >  operations:
> >    list:
> > @@ -769,6 +780,7 @@ operations:
> >              - ifindex
> >              - fd
> >              - queues
> > +            - autorelease
> >          reply:
> >            attributes:
> >              - id
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 9e10f4ac50c3..80d2263ba4ed 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -112,6 +112,7 @@ struct net_iov {
> >         };
> >         struct net_iov_area *owner;
> >         enum net_iov_type type;
> > +       atomic_t uref;
> >  };
> >
> >  struct net_iov_area {
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,7 +352,7 @@ struct sk_filter;
> >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> >    *    @sk_scm_unused: unused flags for scm_recv()
> >    *    @ns_tracker: tracker for netns reference
> > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > +  *    @sk_devmem_info: the devmem binding information for the socket
> >    *    @sk_owner: reference to the real owner of the socket that calls
> >    *               sock_lock_init_class_and_name().
> >    */
> > @@ -584,7 +584,10 @@ struct sock {
> >         struct numa_drop_counters *sk_drop_counters;
> >         struct rcu_head         sk_rcu;
> >         netns_tracker           ns_tracker;
> > -       struct xarray           sk_user_frags;
> > +       struct {
> > +               struct xarray                           frags;
> > +               struct net_devmem_dmabuf_binding        *binding;
> > +       } sk_devmem_info;
> >
> >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> >         struct module           *sk_owner;
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> >         NETDEV_A_DMABUF_QUEUES,
> >         NETDEV_A_DMABUF_FD,
> >         NETDEV_A_DMABUF_ID,
> > +       NETDEV_A_DMABUF_AUTORELEASE,
> >
> >         __NETDEV_A_DMABUF_MAX,
> >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 05a9a9e7abb9..05c16df657c7 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/genalloc.h>
> >  #include <linux/mm.h>
> >  #include <linux/netdevice.h>
> > +#include <linux/skbuff_ref.h>
> >  #include <linux/types.h>
> >  #include <net/netdev_queues.h>
> >  #include <net/netdev_rx_queue.h>
> > @@ -28,6 +29,19 @@
> >
> >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> >
> > +/* If the user unbinds before releasing all tokens, the static key must not
> > + * change until all tokens have been released (to avoid calling the wrong
> > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > + * and binding alloc/free atomic with regards to each other, using the
> > + * devmem_ar_lock. This works because binding free does not occur until all of
> > + * the outstanding token's references on the binding are dropped.
> > + */
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > +static int net_devmem_dmabuf_rx_bindings_count;
> > +
> >  static const struct memory_provider_ops dmabuf_devmem_ops;
> >
> >  bool net_is_devmem_iov(struct net_iov *niov)
> > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> >
> >         size_t size, avail;
> >
> > +       if (binding->direction == DMA_FROM_DEVICE) {
> > +               mutex_lock(&devmem_ar_lock);
> > +               net_devmem_dmabuf_rx_bindings_count--;
> > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > +                       static_branch_disable(&tcp_devmem_ar_key);
> > +               mutex_unlock(&devmem_ar_lock);
> > +       }
> > +
>
> I find this loging with devmem_ar_lock and
> net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> can do another simplification here? Can we have it such that the first
> binding sets the system in autorelease on or autorelease off mode, and
> all future bindings maintain this state? We already don't support
> autorelease on/off mix.
>
>
> >         gen_pool_for_each_chunk(binding->chunk_pool,
> >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> >
> > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> >  }
> >
> > +static void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > +               struct net_iov *niov;
> > +               netmem_ref netmem;
> > +
> > +               niov = binding->vec[i];
> > +               netmem = net_iov_to_netmem(niov);
> > +
> > +               /* Multiple urefs map to only a single netmem ref. */
> > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +       }
> > +}
> > +
> >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >  {
> >         struct netdev_rx_queue *rxq;
> > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> >         }
> >
> > +       net_devmem_dmabuf_binding_put_urefs(binding);
>
> Sigh, I think what you're trying to do here is very complicated. You
> need to think about this scenario:
>
> 1. user binds dmabuf and opens a autorelease=off socket.
> 2. Data arrives on these sockets, and sits in the receive queues,
> recvmsg has not been called yet by the user.
> 3. User unbinds the dma-buff, netmems are still in the receive queues.
> 4. User calls recvmsg on one of these sockets, which obtains a uref on
> the netmems in the receive queues.
> 5. user closes the socket.
>
> With autorelease=on, this works, because the binding remains alive
> until step 5 (even though it's unbound from the queue,
> ..._binding_free has not been called yet) and step 5 cleans up all
> references, even if the binding is unbound but alive, and
>
> calling net_devmem_dmabuf_binding_put_urefs here is weird.
> Autorelease=off implies the user must clean their urefs themselves,
> but we have this here in the unbind path, and it doesn't even
> guarantee that the urefs are free at this point because it may race
> with a recvmsg.
>
> Should we delete this uref cleanup here, and enforce that
> autorelease=off means that the user cleans up the references (the
> kernel never cleans them up on unbind or socket close)? The dontneed
> path needs to work whether the binding is active or unbound.
>

I also now think this scenario could happen, something like:

1. User binds with autorelease=off and open socket.
2. Data arrives on the socket.
3. User unbinds the dmabuf
4. User calls recieves data

Then the user never dontneeds the data received in step 4. AFAICT this
means that the binding will never be freed. This is kinda why my
feeling was that the autorelease property should a property of the
sockets, and each autorelease=off socket should hold a reference on
the binding to ensure it doesn't go away while the socket is open,
then when all the sockets are closed, they drop the references to the
binding, and the binding is freed and the dmabuf is unmapped
regardless on whether the memory has been dontfreed or not, because
the sockets are closed.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-11 19:27     ` Mina Almasry
@ 2026-01-12 17:42       ` Bobby Eshleman
  0 siblings, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-12 17:42 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Sun, Jan 11, 2026 at 11:27:14AM -0800, Mina Almasry wrote:
> On Sun, Jan 11, 2026 at 11:12 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> > > Add support for autorelease toggling of tokens using a static branch to
> > > control system-wide behavior. This allows applications to choose between
> > > two memory management modes:
> > >
> > > 1. Autorelease on: Leaked tokens are automatically released when the
> > >    socket closes.
> > >
> > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > >
> > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > binding is disallowed and is rejected by netlink. The system will be
> > > "locked" into the mode that the first binding is set to. It can only be
> > > changed again once there are zero bindings on the system.
> > >
> > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > >
> > > Static branching is used to limit the system to one mode or the other.
> > >
> > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > ---
> > > Changes in v9:
> > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > >   stubbed out when NET_DEVMEM=n
> > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > >   TX bindings)
> > >
> > > Changes in v8:
> > > - Only reset static key when bindings go to zero, defaulting back to
> > >   disabled (Stan).
> > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > >   use mutex instead.
> > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > >   the static key can not be changed while there are outstanding tokens
> > >   (free is only called when reference count reaches zero).
> > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > >   even after xa_erase(), so static key changes must wait until all
> > >   RX bindings are finally freed (not just when xarray is empty). A
> > >   counter is a simple way to track this.
> > > - socket takes reference on the binding, to avoid use-after-free on
> > >   sk_devmem_info.binding in the case that user releases all tokens,
> > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > - removed some comments that were unnecessary
> > >
> > > Changes in v7:
> > > - implement autorelease with static branch (Stan)
> > > - use netlink instead of sockopt (Stan)
> > > - merge uAPI and implementation patches into one patch (seemed less
> > >   confusing)
> > >
> > > Changes in v6:
> > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > - move binding->autorelease check to outside of
> > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > - use niov without casting back and forth with netmem (Mina)
> > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > >   (Mina)
> > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > - remove single-binding restriction for autorelease mode (Mina)
> > > - unbind always checks for leaked urefs
> > >
> > > Changes in v5:
> > > - remove unused variables
> > > - introduce autorelease flag, preparing for future patch toggle new
> > >   behavior
> > >
> > > Changes in v3:
> > > - make urefs per-binding instead of per-socket, reducing memory
> > >   footprint
> > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > >   tokens
> > > - drop ethtool patch
> > >
> > > Changes in v2:
> > > - always use GFP_ZERO for binding->vec (Mina)
> > > - remove WARN for changed binding (Mina)
> > > - remove extraneous binding ref get (Mina)
> > > - remove WARNs on invalid user input (Mina)
> > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > - fix length of alloc for urefs
> > > ---
> > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > >  include/net/netmem.h                    |   1 +
> > >  include/net/sock.h                      |   7 ++-
> > >  include/uapi/linux/netdev.h             |   1 +
> > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > >  net/core/devmem.h                       |  27 ++++++++-
> > >  net/core/netdev-genl-gen.c              |   5 +-
> > >  net/core/netdev-genl.c                  |  10 ++-
> > >  net/core/sock.c                         |  57 +++++++++++++++--
> > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > >  tools/include/uapi/linux/netdev.h       |   1 +
> > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > --- a/Documentation/netlink/specs/netdev.yaml
> > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > @@ -562,6 +562,17 @@ attribute-sets:
> > >          type: u32
> > >          checks:
> > >            min: 1
> > > +      -
> > > +        name: autorelease
> > > +        doc: |
> > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > +          released when the socket closes. If false (0), leaked tokens are only
> > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > +          specific mode, all subsequent bindings system-wide must use the same
> > > +          mode.
> > > +
> > > +          Optional. Defaults to false if not specified.
> > > +        type: u8
> > >
> > >  operations:
> > >    list:
> > > @@ -769,6 +780,7 @@ operations:
> > >              - ifindex
> > >              - fd
> > >              - queues
> > > +            - autorelease
> > >          reply:
> > >            attributes:
> > >              - id
> > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > --- a/include/net/netmem.h
> > > +++ b/include/net/netmem.h
> > > @@ -112,6 +112,7 @@ struct net_iov {
> > >         };
> > >         struct net_iov_area *owner;
> > >         enum net_iov_type type;
> > > +       atomic_t uref;
> > >  };
> > >
> > >  struct net_iov_area {
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -352,7 +352,7 @@ struct sk_filter;
> > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > >    *    @sk_scm_unused: unused flags for scm_recv()
> > >    *    @ns_tracker: tracker for netns reference
> > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > >    *    @sk_owner: reference to the real owner of the socket that calls
> > >    *               sock_lock_init_class_and_name().
> > >    */
> > > @@ -584,7 +584,10 @@ struct sock {
> > >         struct numa_drop_counters *sk_drop_counters;
> > >         struct rcu_head         sk_rcu;
> > >         netns_tracker           ns_tracker;
> > > -       struct xarray           sk_user_frags;
> > > +       struct {
> > > +               struct xarray                           frags;
> > > +               struct net_devmem_dmabuf_binding        *binding;
> > > +       } sk_devmem_info;
> > >
> > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > >         struct module           *sk_owner;
> > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > index e0b579a1df4f..1e5c209cb998 100644
> > > --- a/include/uapi/linux/netdev.h
> > > +++ b/include/uapi/linux/netdev.h
> > > @@ -207,6 +207,7 @@ enum {
> > >         NETDEV_A_DMABUF_QUEUES,
> > >         NETDEV_A_DMABUF_FD,
> > >         NETDEV_A_DMABUF_ID,
> > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > >
> > >         __NETDEV_A_DMABUF_MAX,
> > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index 05a9a9e7abb9..05c16df657c7 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/genalloc.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/netdevice.h>
> > > +#include <linux/skbuff_ref.h>
> > >  #include <linux/types.h>
> > >  #include <net/netdev_queues.h>
> > >  #include <net/netdev_rx_queue.h>
> > > @@ -28,6 +29,19 @@
> > >
> > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > >
> > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > + * change until all tokens have been released (to avoid calling the wrong
> > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > + * and binding alloc/free atomic with regards to each other, using the
> > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > + * the outstanding token's references on the binding are dropped.
> > > + */
> > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > +
> > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > +
> > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > >
> > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > >
> > >         size_t size, avail;
> > >
> > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > +               mutex_lock(&devmem_ar_lock);
> > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > +               mutex_unlock(&devmem_ar_lock);
> > > +       }
> > > +
> >
> > I find this loging with devmem_ar_lock and
> > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > can do another simplification here? Can we have it such that the first
> > binding sets the system in autorelease on or autorelease off mode, and
> > all future bindings maintain this state? We already don't support
> > autorelease on/off mix.
> >
> >
> > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > >
> > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > >  }
> > >
> > > +static void
> > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > +{
> > > +       int i;
> > > +
> > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > +               struct net_iov *niov;
> > > +               netmem_ref netmem;
> > > +
> > > +               niov = binding->vec[i];
> > > +               netmem = net_iov_to_netmem(niov);
> > > +
> > > +               /* Multiple urefs map to only a single netmem ref. */
> > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > +       }
> > > +}
> > > +
> > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > >  {
> > >         struct netdev_rx_queue *rxq;
> > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > >         }
> > >
> > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> >
> > Sigh, I think what you're trying to do here is very complicated. You
> > need to think about this scenario:
> >
> > 1. user binds dmabuf and opens a autorelease=off socket.
> > 2. Data arrives on these sockets, and sits in the receive queues,
> > recvmsg has not been called yet by the user.
> > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > the netmems in the receive queues.
> > 5. user closes the socket.
> >
> > With autorelease=on, this works, because the binding remains alive
> > until step 5 (even though it's unbound from the queue,
> > ..._binding_free has not been called yet) and step 5 cleans up all
> > references, even if the binding is unbound but alive, and
> >
> > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > Autorelease=off implies the user must clean their urefs themselves,
> > but we have this here in the unbind path, and it doesn't even
> > guarantee that the urefs are free at this point because it may race
> > with a recvmsg.
> >
> > Should we delete this uref cleanup here, and enforce that
> > autorelease=off means that the user cleans up the references (the
> > kernel never cleans them up on unbind or socket close)? The dontneed
> > path needs to work whether the binding is active or unbound.
> >
> 
> I also now think this scenario could happen, something like:
> 
> 1. User binds with autorelease=off and open socket.
> 2. Data arrives on the socket.
> 3. User unbinds the dmabuf
> 4. User calls recieves data
> 
> Then the user never dontneeds the data received in step 4. AFAICT this
> means that the binding will never be freed. This is kinda why my
> feeling was that the autorelease property should a property of the
> sockets, and each autorelease=off socket should hold a reference on
> the binding to ensure it doesn't go away while the socket is open,
> then when all the sockets are closed, they drop the references to the
> binding, and the binding is freed and the dmabuf is unmapped
> regardless on whether the memory has been dontfreed or not, because
> the sockets are closed.

Ah damn, yeah that is a binding leak. I think the core issue for
autorelease=off is that recvmsg() takes a binding reference for every
token, but there is no mechanism to release those when the socket
closes. Instead, it should take only one reference on the binding (the
first time it sees it in recvmsg), and then if binding != NULL on close
it should release it.

AFAICT, I think this change would need to apply to both the socket-based
and the dmabuf-based implementations?

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-11 19:12   ` Mina Almasry
  2026-01-11 19:27     ` Mina Almasry
@ 2026-01-12 16:24     ` Bobby Eshleman
  2026-01-13 19:27       ` Mina Almasry
  2026-01-14 20:54       ` Stanislav Fomichev
  1 sibling, 2 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-12 16:24 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Add support for autorelease toggling of tokens using a static branch to
> > control system-wide behavior. This allows applications to choose between
> > two memory management modes:
> >
> > 1. Autorelease on: Leaked tokens are automatically released when the
> >    socket closes.
> >
> > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> >
> > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > binding is disallowed and is rejected by netlink. The system will be
> > "locked" into the mode that the first binding is set to. It can only be
> > changed again once there are zero bindings on the system.
> >
> > Disabling autorelease offers ~13% improvement in CPU utilization.
> >
> > Static branching is used to limit the system to one mode or the other.
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > ---
> > Changes in v9:
> > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> >   stubbed out when NET_DEVMEM=n
> > - only dec rx binding count for rx bindings in free (v8 did not exclude
> >   TX bindings)
> >
> > Changes in v8:
> > - Only reset static key when bindings go to zero, defaulting back to
> >   disabled (Stan).
> > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> >   use mutex instead.
> > - Access pp_ref_count via niov->desc instead of niov directly.
> > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> >   the static key can not be changed while there are outstanding tokens
> >   (free is only called when reference count reaches zero).
> > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> >   even after xa_erase(), so static key changes must wait until all
> >   RX bindings are finally freed (not just when xarray is empty). A
> >   counter is a simple way to track this.
> > - socket takes reference on the binding, to avoid use-after-free on
> >   sk_devmem_info.binding in the case that user releases all tokens,
> >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > - removed some comments that were unnecessary
> >
> > Changes in v7:
> > - implement autorelease with static branch (Stan)
> > - use netlink instead of sockopt (Stan)
> > - merge uAPI and implementation patches into one patch (seemed less
> >   confusing)
> >
> > Changes in v6:
> > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > - move binding->autorelease check to outside of
> >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > - remove overly defensive net_is_devmem_iov() (Mina)
> > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > - use niov without casting back and forth with netmem (Mina)
> > - move the autorelease flag from per-binding to per-socket (Mina)
> > - remove the batching logic in sock_devmem_dontneed_manual_release()
> >   (Mina)
> > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > - remove single-binding restriction for autorelease mode (Mina)
> > - unbind always checks for leaked urefs
> >
> > Changes in v5:
> > - remove unused variables
> > - introduce autorelease flag, preparing for future patch toggle new
> >   behavior
> >
> > Changes in v3:
> > - make urefs per-binding instead of per-socket, reducing memory
> >   footprint
> > - fallback to cleaning up references in dmabuf unbind if socket leaked
> >   tokens
> > - drop ethtool patch
> >
> > Changes in v2:
> > - always use GFP_ZERO for binding->vec (Mina)
> > - remove WARN for changed binding (Mina)
> > - remove extraneous binding ref get (Mina)
> > - remove WARNs on invalid user input (Mina)
> > - pre-assign niovs in binding->vec for RX case (Mina)
> > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > - fix length of alloc for urefs
> > ---
> >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> >  include/net/netmem.h                    |   1 +
> >  include/net/sock.h                      |   7 ++-
> >  include/uapi/linux/netdev.h             |   1 +
> >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> >  net/core/devmem.h                       |  27 ++++++++-
> >  net/core/netdev-genl-gen.c              |   5 +-
> >  net/core/netdev-genl.c                  |  10 ++-
> >  net/core/sock.c                         |  57 +++++++++++++++--
> >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> >  net/ipv4/tcp_minisocks.c                |   3 +-
> >  tools/include/uapi/linux/netdev.h       |   1 +
> >  13 files changed, 269 insertions(+), 46 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..7cbe9e7b9ee5 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> >          type: u32
> >          checks:
> >            min: 1
> > +      -
> > +        name: autorelease
> > +        doc: |
> > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > +          released when the socket closes. If false (0), leaked tokens are only
> > +          released when the dmabuf is unbound. Once a binding is created with a
> > +          specific mode, all subsequent bindings system-wide must use the same
> > +          mode.
> > +
> > +          Optional. Defaults to false if not specified.
> > +        type: u8
> >
> >  operations:
> >    list:
> > @@ -769,6 +780,7 @@ operations:
> >              - ifindex
> >              - fd
> >              - queues
> > +            - autorelease
> >          reply:
> >            attributes:
> >              - id
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 9e10f4ac50c3..80d2263ba4ed 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -112,6 +112,7 @@ struct net_iov {
> >         };
> >         struct net_iov_area *owner;
> >         enum net_iov_type type;
> > +       atomic_t uref;
> >  };
> >
> >  struct net_iov_area {
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,7 +352,7 @@ struct sk_filter;
> >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> >    *    @sk_scm_unused: unused flags for scm_recv()
> >    *    @ns_tracker: tracker for netns reference
> > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > +  *    @sk_devmem_info: the devmem binding information for the socket
> >    *    @sk_owner: reference to the real owner of the socket that calls
> >    *               sock_lock_init_class_and_name().
> >    */
> > @@ -584,7 +584,10 @@ struct sock {
> >         struct numa_drop_counters *sk_drop_counters;
> >         struct rcu_head         sk_rcu;
> >         netns_tracker           ns_tracker;
> > -       struct xarray           sk_user_frags;
> > +       struct {
> > +               struct xarray                           frags;
> > +               struct net_devmem_dmabuf_binding        *binding;
> > +       } sk_devmem_info;
> >
> >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> >         struct module           *sk_owner;
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> >         NETDEV_A_DMABUF_QUEUES,
> >         NETDEV_A_DMABUF_FD,
> >         NETDEV_A_DMABUF_ID,
> > +       NETDEV_A_DMABUF_AUTORELEASE,
> >
> >         __NETDEV_A_DMABUF_MAX,
> >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 05a9a9e7abb9..05c16df657c7 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/genalloc.h>
> >  #include <linux/mm.h>
> >  #include <linux/netdevice.h>
> > +#include <linux/skbuff_ref.h>
> >  #include <linux/types.h>
> >  #include <net/netdev_queues.h>
> >  #include <net/netdev_rx_queue.h>
> > @@ -28,6 +29,19 @@
> >
> >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> >
> > +/* If the user unbinds before releasing all tokens, the static key must not
> > + * change until all tokens have been released (to avoid calling the wrong
> > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > + * and binding alloc/free atomic with regards to each other, using the
> > + * devmem_ar_lock. This works because binding free does not occur until all of
> > + * the outstanding token's references on the binding are dropped.
> > + */
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > +static int net_devmem_dmabuf_rx_bindings_count;
> > +
> >  static const struct memory_provider_ops dmabuf_devmem_ops;
> >
> >  bool net_is_devmem_iov(struct net_iov *niov)
> > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> >
> >         size_t size, avail;
> >
> > +       if (binding->direction == DMA_FROM_DEVICE) {
> > +               mutex_lock(&devmem_ar_lock);
> > +               net_devmem_dmabuf_rx_bindings_count--;
> > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > +                       static_branch_disable(&tcp_devmem_ar_key);
> > +               mutex_unlock(&devmem_ar_lock);
> > +       }
> > +
> 
> I find this loging with devmem_ar_lock and
> net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> can do another simplification here? Can we have it such that the first
> binding sets the system in autorelease on or autorelease off mode, and
> all future bindings maintain this state? We already don't support
> autorelease on/off mix.

I think that would greatly simplify things. We would still need a lock
to make the static branch change and first release mode setting atomic WRT
each other, but the other parts (like the one above) can be
removed.

> 
> 
> >         gen_pool_for_each_chunk(binding->chunk_pool,
> >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> >
> > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> >  }
> >
> > +static void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > +               struct net_iov *niov;
> > +               netmem_ref netmem;
> > +
> > +               niov = binding->vec[i];
> > +               netmem = net_iov_to_netmem(niov);
> > +
> > +               /* Multiple urefs map to only a single netmem ref. */
> > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +       }
> > +}
> > +
> >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >  {
> >         struct netdev_rx_queue *rxq;
> > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> >         }
> >
> > +       net_devmem_dmabuf_binding_put_urefs(binding);
> 
> Sigh, I think what you're trying to do here is very complicated. You
> need to think about this scenario:
> 
> 1. user binds dmabuf and opens a autorelease=off socket.
> 2. Data arrives on these sockets, and sits in the receive queues,
> recvmsg has not been called yet by the user.
> 3. User unbinds the dma-buff, netmems are still in the receive queues.
> 4. User calls recvmsg on one of these sockets, which obtains a uref on
> the netmems in the receive queues.
> 5. user closes the socket.
> 
> With autorelease=on, this works, because the binding remains alive
> until step 5 (even though it's unbound from the queue,
> ..._binding_free has not been called yet) and step 5 cleans up all
> references, even if the binding is unbound but alive, and
> 
> calling net_devmem_dmabuf_binding_put_urefs here is weird.
> Autorelease=off implies the user must clean their urefs themselves,
> but we have this here in the unbind path, and it doesn't even
> guarantee that the urefs are free at this point because it may race
> with a recvmsg.
> 
> Should we delete this uref cleanup here, and enforce that
> autorelease=off means that the user cleans up the references (the
> kernel never cleans them up on unbind or socket close)? The dontneed
> path needs to work whether the binding is active or unbound.
> 

I agree, I think we can do away with the "unbind drops references" idea.
A counter argument could be that it introduces the ability for one
process to interfere with another, but in fact that is already possible
with autorelease=on by not issuing dontneed and starving the other of
tokens.

> >         net_devmem_dmabuf_binding_put(binding);
> >  }
> >
> > @@ -179,8 +220,10 @@ struct net_devmem_dmabuf_binding *
> >  net_devmem_bind_dmabuf(struct net_device *dev,
> >                        struct device *dma_dev,
> >                        enum dma_data_direction direction,
> > -                      unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack)
> > +                      unsigned int dmabuf_fd,
> > +                      struct netdev_nl_sock *priv,
> > +                      struct netlink_ext_ack *extack,
> > +                      bool autorelease)
> >  {
> >         struct net_devmem_dmabuf_binding *binding;
> >         static u32 id_alloc_next;
> > @@ -231,14 +274,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                 goto err_detach;
> >         }
> >
> > -       if (direction == DMA_TO_DEVICE) {
> > -               binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > -                                             sizeof(struct net_iov *),
> > -                                             GFP_KERNEL);
> > -               if (!binding->vec) {
> > -                       err = -ENOMEM;
> > -                       goto err_unmap;
> > -               }
> > +       binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > +                                     sizeof(struct net_iov *),
> > +                                     GFP_KERNEL | __GFP_ZERO);
> > +       if (!binding->vec) {
> > +               err = -ENOMEM;
> > +               goto err_unmap;
> >         }
> >
> >         /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > @@ -292,25 +333,62 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                         niov = &owner->area.niovs[i];
> >                         niov->type = NET_IOV_DMABUF;
> >                         niov->owner = &owner->area;
> > +                       atomic_set(&niov->uref, 0);
> >                         page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> >                                                       net_devmem_get_dma_addr(niov));
> > -                       if (direction == DMA_TO_DEVICE)
> > -                               binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> > +                       binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> >                 }
> >
> >                 virtual += len;
> >         }
> >
> > +       mutex_lock(&devmem_ar_lock);
> > +
> > +       if (direction == DMA_FROM_DEVICE) {
> > +               if (net_devmem_dmabuf_rx_bindings_count > 0) {
> > +                       bool mode;
> > +
> > +                       mode = static_key_enabled(&tcp_devmem_ar_key);
> > +
> > +                       /* When bindings exist, enforce that the mode does not
> > +                        * change.
> > +                        */
> > +                       if (mode != autorelease) {
> > +                               NL_SET_ERR_MSG_FMT(extack,
> > +                                                  "System already configured with autorelease=%d",
> > +                                                  mode);
> > +                               err = -EINVAL;
> > +                               goto err_unlock_mutex;
> > +                       }
> > +               } else if (autorelease) {
> > +                       /* First binding with autorelease enabled sets the
> > +                        * mode.  If autorelease is false, the key is already
> > +                        * disabled by default so no action is needed.
> > +                        */
> > +                       static_branch_enable(&tcp_devmem_ar_key);
> > +               }
> > +
> > +               net_devmem_dmabuf_rx_bindings_count++;
> > +       }
> > +
> >         err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> >                               binding, xa_limit_32b, &id_alloc_next,
> >                               GFP_KERNEL);
> >         if (err < 0)
> > -               goto err_free_chunks;
> > +               goto err_dec_binding_count;
> > +
> > +       mutex_unlock(&devmem_ar_lock);
> >
> >         list_add(&binding->list, &priv->bindings);
> >
> >         return binding;
> >
> > +err_dec_binding_count:
> > +       if (direction == DMA_FROM_DEVICE)
> > +               net_devmem_dmabuf_rx_bindings_count--;
> > +
> > +err_unlock_mutex:
> > +       mutex_unlock(&devmem_ar_lock);
> >  err_free_chunks:
> >         gen_pool_for_each_chunk(binding->chunk_pool,
> >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > diff --git a/net/core/devmem.h b/net/core/devmem.h
> > index 1ea6228e4f40..8c586f30e371 100644
> > --- a/net/core/devmem.h
> > +++ b/net/core/devmem.h
> > @@ -12,9 +12,13 @@
> >
> >  #include <net/netmem.h>
> >  #include <net/netdev_netlink.h>
> > +#include <linux/jump_label.h>
> >
> >  struct netlink_ext_ack;
> >
> > +/* static key for TCP devmem autorelease */
> > +extern struct static_key_false tcp_devmem_ar_key;
> > +
> >  struct net_devmem_dmabuf_binding {
> >         struct dma_buf *dmabuf;
> >         struct dma_buf_attachment *attachment;
> > @@ -61,7 +65,7 @@ struct net_devmem_dmabuf_binding {
> >
> >         /* Array of net_iov pointers for this binding, sorted by virtual
> >          * address. This array is convenient to map the virtual addresses to
> > -        * net_iovs in the TX path.
> > +        * net_iovs.
> >          */
> >         struct net_iov **vec;
> >
> > @@ -88,7 +92,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                        struct device *dma_dev,
> >                        enum dma_data_direction direction,
> >                        unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack);
> > +                      struct netlink_ext_ack *extack, bool autorelease);
> >  struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
> >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
> >  int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > @@ -138,6 +142,11 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> >         schedule_work(&binding->unbind_w);
> >  }
> >
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > +       return static_branch_unlikely(&tcp_devmem_ar_key);
> > +}
> > +
> >  void net_devmem_get_net_iov(struct net_iov *niov);
> >  void net_devmem_put_net_iov(struct net_iov *niov);
> >
> > @@ -155,6 +164,12 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
> >  #else
> >  struct net_devmem_dmabuf_binding;
> >
> > +static inline bool
> > +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void
> >  net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> >  {
> > @@ -174,7 +189,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> >                        enum dma_data_direction direction,
> >                        unsigned int dmabuf_fd,
> >                        struct netdev_nl_sock *priv,
> > -                      struct netlink_ext_ack *extack)
> > +                      struct netlink_ext_ack *extack,
> > +                      bool autorelease)
> >  {
> >         return ERR_PTR(-EOPNOTSUPP);
> >  }
> > @@ -241,6 +257,11 @@ net_devmem_iov_binding(const struct net_iov *niov)
> >  {
> >         return NULL;
> >  }
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > +       return false;
> > +}
> >  #endif
> >
> >  #endif /* _NET_DEVMEM_H */
> > diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> > index ba673e81716f..01b7765e11ec 100644
> > --- a/net/core/netdev-genl-gen.c
> > +++ b/net/core/netdev-genl-gen.c
> > @@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
> >  };
> >
> >  /* NETDEV_CMD_BIND_RX - do */
> > -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
> > +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
> >         [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> >         [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
> >         [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> > +       [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
> >  };
> >
> >  /* NETDEV_CMD_NAPI_SET - do */
> > @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
> >                 .cmd            = NETDEV_CMD_BIND_RX,
> >                 .doit           = netdev_nl_bind_rx_doit,
> >                 .policy         = netdev_bind_rx_nl_policy,
> > -               .maxattr        = NETDEV_A_DMABUF_FD,
> > +               .maxattr        = NETDEV_A_DMABUF_AUTORELEASE,
> >                 .flags          = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> >         },
> >         {
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 470fabbeacd9..c742bb34865e 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         struct netdev_nl_sock *priv;
> >         struct net_device *netdev;
> >         unsigned long *rxq_bitmap;
> > +       bool autorelease = false;
> >         struct device *dma_dev;
> >         struct sk_buff *rsp;
> >         int err = 0;
> > @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> >         dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
> >
> > +       if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> > +               autorelease =
> > +                       !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> > +
> >         priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> >         if (IS_ERR(priv))
> >                 return PTR_ERR(priv);
> > @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >         }
> >
> >         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> > -                                        dmabuf_fd, priv, info->extack);
> > +                                        dmabuf_fd, priv, info->extack,
> > +                                        autorelease);
> >         if (IS_ERR(binding)) {
> >                 err = PTR_ERR(binding);
> >                 goto err_rxq_bitmap;
> > @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
> >
> >         dma_dev = netdev_queue_get_dma_dev(netdev, 0);
> >         binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> > -                                        dmabuf_fd, priv, info->extack);
> > +                                        dmabuf_fd, priv, info->extack, false);
> >         if (IS_ERR(binding)) {
> >                 err = PTR_ERR(binding);
> >                 goto err_unlock_netdev;
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index f6526f43aa6e..6355c2ccfb8a 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -87,6 +87,7 @@
> >
> >  #include <linux/unaligned.h>
> >  #include <linux/capability.h>
> > +#include <linux/dma-buf.h>
> >  #include <linux/errno.h>
> >  #include <linux/errqueue.h>
> >  #include <linux/types.h>
> > @@ -151,6 +152,7 @@
> >  #include <uapi/linux/pidfd.h>
> >
> >  #include "dev.h"
> > +#include "devmem.h"
> >
> >  static DEFINE_MUTEX(proto_list_mutex);
> >  static LIST_HEAD(proto_list);
> > @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> >  #define MAX_DONTNEED_TOKENS 128
> >  #define MAX_DONTNEED_FRAGS 1024
> >
> > +static noinline_for_stack int
> > +sock_devmem_dontneed_manual_release(struct sock *sk,
> > +                                   struct dmabuf_token *tokens,
> > +                                   unsigned int num_tokens)
> > +{
> > +       struct net_iov *niov;
> > +       unsigned int i, j;
> > +       netmem_ref netmem;
> > +       unsigned int token;
> > +       int num_frags = 0;
> > +       int ret = 0;
> > +
> > +       if (!sk->sk_devmem_info.binding)
> > +               return -EINVAL;
> > +
> > +       for (i = 0; i < num_tokens; i++) {
> > +               for (j = 0; j < tokens[i].token_count; j++) {
> > +                       size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> > +
> > +                       token = tokens[i].token_start + j;
> > +                       if (token >= size / PAGE_SIZE)
> > +                               break;
> > +
> > +                       if (++num_frags > MAX_DONTNEED_FRAGS)
> > +                               return ret;
> > +
> > +                       niov = sk->sk_devmem_info.binding->vec[token];
> > +                       if (atomic_dec_and_test(&niov->uref)) {
> > +                               netmem = net_iov_to_netmem(niov);
> > +                               WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > +                       }
> > +                       ret++;
> > +               }
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> >  static noinline_for_stack int
> >  sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> >                                  unsigned int num_tokens)
> > @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> >         int ret = 0, num_frags = 0;
> >         netmem_ref netmems[16];
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >         for (i = 0; i < num_tokens; i++) {
> >                 for (j = 0; j < tokens[i].token_count; j++) {
> >                         if (++num_frags > MAX_DONTNEED_FRAGS)
> >                                 goto frag_limit_reached;
> >
> >                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > -                               &sk->sk_user_frags, tokens[i].token_start + j);
> > +                               &sk->sk_devmem_info.frags,
> > +                               tokens[i].token_start + j);
> >
> >                         if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> >                                 continue;
> >
> >                         netmems[netmem_num++] = netmem;
> >                         if (netmem_num == ARRAY_SIZE(netmems)) {
> > -                               xa_unlock_bh(&sk->sk_user_frags);
> > +                               xa_unlock_bh(&sk->sk_devmem_info.frags);
> >                                 for (k = 0; k < netmem_num; k++)
> >                                         WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >                                 netmem_num = 0;
> > -                               xa_lock_bh(&sk->sk_user_frags);
> > +                               xa_lock_bh(&sk->sk_devmem_info.frags);
> >                         }
> >                         ret++;
> >                 }
> >         }
> >
> >  frag_limit_reached:
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >         for (k = 0; k < netmem_num; k++)
> >                 WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >
> > @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> >                 return -EFAULT;
> >         }
> >
> > -       ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > +       if (net_devmem_autorelease_enabled())
> > +               ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > +       else
> > +               ret = sock_devmem_dontneed_manual_release(sk, tokens,
> > +                                                         num_tokens);
> >
> >         kvfree(tokens);
> >         return ret;
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index d5319ebe2452..a8a4af552909 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -260,6 +260,7 @@
> >  #include <linux/memblock.h>
> >  #include <linux/highmem.h>
> >  #include <linux/cache.h>
> > +#include <linux/dma-buf.h>
> >  #include <linux/err.h>
> >  #include <linux/time.h>
> >  #include <linux/slab.h>
> > @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
> >
> >         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> >         sk_sockets_allocated_inc(sk);
> > -       xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> > +       xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > +       sk->sk_devmem_info.binding = NULL;
> >  }
> >  EXPORT_IPV6_MOD(tcp_init_sock);
> >
> > @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> >         /* Commit part that has been copied to user space. */
> >         for (i = 0; i < p->idx; i++)
> > -               __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> > -                            (__force void *)p->netmems[i], GFP_KERNEL);
> > +               __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> > +                            XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> > +                            GFP_KERNEL);
> >         /* Rollback what has been pre-allocated and is no longer needed. */
> >         for (; i < p->max; i++)
> > -               __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> > +               __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
> >
> >         p->max = 0;
> >         p->idx = 0;
> > @@ -2436,14 +2439,18 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> >  static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
> >  {
> > +       /* Skip xarray operations if autorelease is disabled (manual mode) */
> > +       if (!net_devmem_autorelease_enabled())
> > +               return;
> > +
> >         if (!p->max)
> >                 return;
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> >         tcp_xa_pool_commit_locked(sk, p);
> >
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >  }
> >
> >  static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > @@ -2454,24 +2461,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> >         if (p->idx < p->max)
> >                 return 0;
> >
> > -       xa_lock_bh(&sk->sk_user_frags);
> > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> >         tcp_xa_pool_commit_locked(sk, p);
> >
> >         for (k = 0; k < max_frags; k++) {
> > -               err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> > +               err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
> >                                  XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> >                 if (err)
> >                         break;
> >         }
> >
> > -       xa_unlock_bh(&sk->sk_user_frags);
> > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> >
> >         p->max = k;
> >         p->idx = 0;
> >         return k ? 0 : err;
> >  }
> >
> > +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> > +                                        skb_frag_t *frag)
> > +{
> > +       struct net_iov *niov;
> > +
> > +       niov = skb_frag_net_iov(frag);
> > +
> > +       if (net_devmem_autorelease_enabled()) {
> > +               atomic_long_inc(&niov->desc.pp_ref_count);
> > +               tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> > +                       skb_frag_netmem(frag);
> > +       } else {
> > +               if (atomic_inc_return(&niov->uref) == 1)
> > +                       atomic_long_inc(&niov->desc.pp_ref_count);
> > +       }
> > +}
> > +
> >  /* On error, returns the -errno. On success, returns number of bytes sent to the
> >   * user. May not consume all of @remaining_len.
> >   */
> > @@ -2533,6 +2557,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                  * sequence of cmsg
> >                  */
> >                 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> > +                       struct net_devmem_dmabuf_binding *binding = NULL;
> >                         skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> >                         struct net_iov *niov;
> >                         u64 frag_offset;
> > @@ -2568,13 +2593,35 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                                               start;
> >                                 dmabuf_cmsg.frag_offset = frag_offset;
> >                                 dmabuf_cmsg.frag_size = copy;
> > -                               err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> > -                                                        skb_shinfo(skb)->nr_frags - i);
> > -                               if (err)
> > +
> > +                               binding = net_devmem_iov_binding(niov);
> > +
> > +                               if (!sk->sk_devmem_info.binding) {
> > +                                       net_devmem_dmabuf_binding_get(binding);
> > +                                       sk->sk_devmem_info.binding = binding;
> > +                               }
> > +
> > +                               if (sk->sk_devmem_info.binding != binding) {
> > +                                       err = -EFAULT;
> >                                         goto out;
> > +                               }
> > +
> > +                               if (net_devmem_autorelease_enabled()) {
> > +                                       err = tcp_xa_pool_refill(sk,
> > +                                                                &tcp_xa_pool,
> > +                                                                skb_shinfo(skb)->nr_frags - i);
> > +                                       if (err)
> > +                                               goto out;
> > +
> > +                                       dmabuf_cmsg.frag_token =
> > +                                               tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > +                               } else {
> > +                                       dmabuf_cmsg.frag_token =
> > +                                               net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> > +                               }
> > +
> >
> >                                 /* Will perform the exchange later */
> > -                               dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
> >                                 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
> >
> >                                 offset += copy;
> > @@ -2587,8 +2634,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >                                 if (err)
> >                                         goto out;
> >
> > -                               atomic_long_inc(&niov->desc.pp_ref_count);
> > -                               tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> > +                               tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
> >
> >                                 sent += copy;
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index f8a9596e8f4d..7b1b5a17002f 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -89,6 +89,9 @@
> >
> >  #include <crypto/md5.h>
> >
> > +#include <linux/dma-buf.h>
> > +#include "../core/devmem.h"
> > +
> >  #include <trace/events/tcp.h>
> >
> >  #ifdef CONFIG_TCP_MD5SIG
> > @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
> >         unsigned long index;
> >         void *netmem;
> >
> > -       xa_for_each(&sk->sk_user_frags, index, netmem)
> > +       xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
> >                 WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> >  #endif
> >  }
> > @@ -2503,7 +2506,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
> >
> >         tcp_release_user_frags(sk);
> >
> > -       xa_destroy(&sk->sk_user_frags);
> > +       xa_destroy(&sk->sk_devmem_info.frags);
> > +       if (sk->sk_devmem_info.binding) {
> > +               net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> 
> I don't understand the refcounting relationsship between the binding
> and the socket when autorelease=off. It seems you're grabbing a
> reference on the binding on every recvmsg(), but only dropping the
> reference on the binding once here, so the binding is never freed.

We grab a ref only the one time that sk_devmem_info.binding is set, and
then we release it before setting sk_devmem_info.binding back to NULL.
This is only done to avoid use-after-free on this pointer in the
situation:

1. socket releases all tokens
2. dmabuf is unbound (ref count reaches zero, it is freed)
3. socket sends some bad token to dontneed which dereferences
use-after-free sk_devmem_info.binding

The other option was bindings having a list of sockets that use it or
something like that... but I think this is a good use case for the ref
count.

> 
> > +               sk->sk_devmem_info.binding = NULL;
> > +       }
> >
> >         trace_tcp_destroy_sock(sk);
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index bd5462154f97..2aec977f5c12 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> >
> >         __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> >
> > -       xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> > +       xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > +       newsk->sk_devmem_info.binding = NULL;
> >
> >         return newsk;
> >  }
> > diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/tools/include/uapi/linux/netdev.h
> > +++ b/tools/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> >         NETDEV_A_DMABUF_QUEUES,
> >         NETDEV_A_DMABUF_FD,
> >         NETDEV_A_DMABUF_ID,
> > +       NETDEV_A_DMABUF_AUTORELEASE,
> >
> >         __NETDEV_A_DMABUF_MAX,
> >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> >
> > --
> > 2.47.3
> >
> 
> 
> -- 
> Thanks,
> Mina

Thanks for the review Mina!

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-12 16:24     ` Bobby Eshleman
@ 2026-01-13 19:27       ` Mina Almasry
  2026-01-13 20:32         ` Bobby Eshleman
  2026-01-14 20:54       ` Stanislav Fomichev
  1 sibling, 1 reply; 21+ messages in thread
From: Mina Almasry @ 2026-01-13 19:27 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Mon, Jan 12, 2026 at 8:24 AM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> > > Add support for autorelease toggling of tokens using a static branch to
> > > control system-wide behavior. This allows applications to choose between
> > > two memory management modes:
> > >
> > > 1. Autorelease on: Leaked tokens are automatically released when the
> > >    socket closes.
> > >
> > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > >
> > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > binding is disallowed and is rejected by netlink. The system will be
> > > "locked" into the mode that the first binding is set to. It can only be
> > > changed again once there are zero bindings on the system.
> > >
> > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > >
> > > Static branching is used to limit the system to one mode or the other.
> > >
> > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > ---
> > > Changes in v9:
> > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > >   stubbed out when NET_DEVMEM=n
> > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > >   TX bindings)
> > >
> > > Changes in v8:
> > > - Only reset static key when bindings go to zero, defaulting back to
> > >   disabled (Stan).
> > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > >   use mutex instead.
> > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > >   the static key can not be changed while there are outstanding tokens
> > >   (free is only called when reference count reaches zero).
> > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > >   even after xa_erase(), so static key changes must wait until all
> > >   RX bindings are finally freed (not just when xarray is empty). A
> > >   counter is a simple way to track this.
> > > - socket takes reference on the binding, to avoid use-after-free on
> > >   sk_devmem_info.binding in the case that user releases all tokens,
> > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > - removed some comments that were unnecessary
> > >
> > > Changes in v7:
> > > - implement autorelease with static branch (Stan)
> > > - use netlink instead of sockopt (Stan)
> > > - merge uAPI and implementation patches into one patch (seemed less
> > >   confusing)
> > >
> > > Changes in v6:
> > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > - move binding->autorelease check to outside of
> > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > - use niov without casting back and forth with netmem (Mina)
> > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > >   (Mina)
> > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > - remove single-binding restriction for autorelease mode (Mina)
> > > - unbind always checks for leaked urefs
> > >
> > > Changes in v5:
> > > - remove unused variables
> > > - introduce autorelease flag, preparing for future patch toggle new
> > >   behavior
> > >
> > > Changes in v3:
> > > - make urefs per-binding instead of per-socket, reducing memory
> > >   footprint
> > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > >   tokens
> > > - drop ethtool patch
> > >
> > > Changes in v2:
> > > - always use GFP_ZERO for binding->vec (Mina)
> > > - remove WARN for changed binding (Mina)
> > > - remove extraneous binding ref get (Mina)
> > > - remove WARNs on invalid user input (Mina)
> > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > - fix length of alloc for urefs
> > > ---
> > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > >  include/net/netmem.h                    |   1 +
> > >  include/net/sock.h                      |   7 ++-
> > >  include/uapi/linux/netdev.h             |   1 +
> > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > >  net/core/devmem.h                       |  27 ++++++++-
> > >  net/core/netdev-genl-gen.c              |   5 +-
> > >  net/core/netdev-genl.c                  |  10 ++-
> > >  net/core/sock.c                         |  57 +++++++++++++++--
> > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > >  tools/include/uapi/linux/netdev.h       |   1 +
> > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > --- a/Documentation/netlink/specs/netdev.yaml
> > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > @@ -562,6 +562,17 @@ attribute-sets:
> > >          type: u32
> > >          checks:
> > >            min: 1
> > > +      -
> > > +        name: autorelease
> > > +        doc: |
> > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > +          released when the socket closes. If false (0), leaked tokens are only
> > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > +          specific mode, all subsequent bindings system-wide must use the same
> > > +          mode.
> > > +
> > > +          Optional. Defaults to false if not specified.
> > > +        type: u8
> > >
> > >  operations:
> > >    list:
> > > @@ -769,6 +780,7 @@ operations:
> > >              - ifindex
> > >              - fd
> > >              - queues
> > > +            - autorelease
> > >          reply:
> > >            attributes:
> > >              - id
> > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > --- a/include/net/netmem.h
> > > +++ b/include/net/netmem.h
> > > @@ -112,6 +112,7 @@ struct net_iov {
> > >         };
> > >         struct net_iov_area *owner;
> > >         enum net_iov_type type;
> > > +       atomic_t uref;
> > >  };
> > >
> > >  struct net_iov_area {
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -352,7 +352,7 @@ struct sk_filter;
> > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > >    *    @sk_scm_unused: unused flags for scm_recv()
> > >    *    @ns_tracker: tracker for netns reference
> > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > >    *    @sk_owner: reference to the real owner of the socket that calls
> > >    *               sock_lock_init_class_and_name().
> > >    */
> > > @@ -584,7 +584,10 @@ struct sock {
> > >         struct numa_drop_counters *sk_drop_counters;
> > >         struct rcu_head         sk_rcu;
> > >         netns_tracker           ns_tracker;
> > > -       struct xarray           sk_user_frags;
> > > +       struct {
> > > +               struct xarray                           frags;
> > > +               struct net_devmem_dmabuf_binding        *binding;
> > > +       } sk_devmem_info;
> > >
> > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > >         struct module           *sk_owner;
> > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > index e0b579a1df4f..1e5c209cb998 100644
> > > --- a/include/uapi/linux/netdev.h
> > > +++ b/include/uapi/linux/netdev.h
> > > @@ -207,6 +207,7 @@ enum {
> > >         NETDEV_A_DMABUF_QUEUES,
> > >         NETDEV_A_DMABUF_FD,
> > >         NETDEV_A_DMABUF_ID,
> > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > >
> > >         __NETDEV_A_DMABUF_MAX,
> > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index 05a9a9e7abb9..05c16df657c7 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/genalloc.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/netdevice.h>
> > > +#include <linux/skbuff_ref.h>
> > >  #include <linux/types.h>
> > >  #include <net/netdev_queues.h>
> > >  #include <net/netdev_rx_queue.h>
> > > @@ -28,6 +29,19 @@
> > >
> > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > >
> > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > + * change until all tokens have been released (to avoid calling the wrong
> > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > + * and binding alloc/free atomic with regards to each other, using the
> > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > + * the outstanding token's references on the binding are dropped.
> > > + */
> > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > +
> > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > +
> > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > >
> > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > >
> > >         size_t size, avail;
> > >
> > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > +               mutex_lock(&devmem_ar_lock);
> > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > +               mutex_unlock(&devmem_ar_lock);
> > > +       }
> > > +
> >
> > I find this loging with devmem_ar_lock and
> > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > can do another simplification here? Can we have it such that the first
> > binding sets the system in autorelease on or autorelease off mode, and
> > all future bindings maintain this state? We already don't support
> > autorelease on/off mix.
>
> I think that would greatly simplify things. We would still need a lock
> to make the static branch change and first release mode setting atomic WRT
> each other, but the other parts (like the one above) can be
> removed.
>
> >
> >
> > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > >
> > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > >  }
> > >
> > > +static void
> > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > +{
> > > +       int i;
> > > +
> > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > +               struct net_iov *niov;
> > > +               netmem_ref netmem;
> > > +
> > > +               niov = binding->vec[i];
> > > +               netmem = net_iov_to_netmem(niov);
> > > +
> > > +               /* Multiple urefs map to only a single netmem ref. */
> > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > +       }
> > > +}
> > > +
> > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > >  {
> > >         struct netdev_rx_queue *rxq;
> > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > >         }
> > >
> > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> >
> > Sigh, I think what you're trying to do here is very complicated. You
> > need to think about this scenario:
> >
> > 1. user binds dmabuf and opens a autorelease=off socket.
> > 2. Data arrives on these sockets, and sits in the receive queues,
> > recvmsg has not been called yet by the user.
> > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > the netmems in the receive queues.
> > 5. user closes the socket.
> >
> > With autorelease=on, this works, because the binding remains alive
> > until step 5 (even though it's unbound from the queue,
> > ..._binding_free has not been called yet) and step 5 cleans up all
> > references, even if the binding is unbound but alive, and
> >
> > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > Autorelease=off implies the user must clean their urefs themselves,
> > but we have this here in the unbind path, and it doesn't even
> > guarantee that the urefs are free at this point because it may race
> > with a recvmsg.
> >
> > Should we delete this uref cleanup here, and enforce that
> > autorelease=off means that the user cleans up the references (the
> > kernel never cleans them up on unbind or socket close)? The dontneed
> > path needs to work whether the binding is active or unbound.
> >
>
> I agree, I think we can do away with the "unbind drops references" idea.
> A counter argument could be that it introduces the ability for one
> process to interfere with another, but in fact that is already possible
> with autorelease=on by not issuing dontneed and starving the other of
> tokens.
>

On second thought I don't think we can remove the references drop
completely. AFAIU if the userspace misbehaves and doens't dontneed the
netmems in this setup, then the binding will leak forever, which is
really not great.

I think what may work is having a refcount on the binding for each
rxqueue it's bound to and each socket that's using it. Once that
refcount drops to 0, then we can be sure that the urefs in the binding
are not in use anymore, and we can drop the urefs, which should make
the binding refcount to hit 0 and the _binding_free() function to be
called.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-13 19:27       ` Mina Almasry
@ 2026-01-13 20:32         ` Bobby Eshleman
  2026-01-13 20:42           ` Mina Almasry
  0 siblings, 1 reply; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-13 20:32 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 13, 2026 at 11:27:38AM -0800, Mina Almasry wrote:
> On Mon, Jan 12, 2026 at 8:24 AM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > >
> > > > Add support for autorelease toggling of tokens using a static branch to
> > > > control system-wide behavior. This allows applications to choose between
> > > > two memory management modes:
> > > >
> > > > 1. Autorelease on: Leaked tokens are automatically released when the
> > > >    socket closes.
> > > >
> > > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > > >
> > > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > > binding is disallowed and is rejected by netlink. The system will be
> > > > "locked" into the mode that the first binding is set to. It can only be
> > > > changed again once there are zero bindings on the system.
> > > >
> > > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > > >
> > > > Static branching is used to limit the system to one mode or the other.
> > > >
> > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > ---
> > > > Changes in v9:
> > > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > > >   stubbed out when NET_DEVMEM=n
> > > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > > >   TX bindings)
> > > >
> > > > Changes in v8:
> > > > - Only reset static key when bindings go to zero, defaulting back to
> > > >   disabled (Stan).
> > > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > > >   use mutex instead.
> > > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > > >   the static key can not be changed while there are outstanding tokens
> > > >   (free is only called when reference count reaches zero).
> > > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > > >   even after xa_erase(), so static key changes must wait until all
> > > >   RX bindings are finally freed (not just when xarray is empty). A
> > > >   counter is a simple way to track this.
> > > > - socket takes reference on the binding, to avoid use-after-free on
> > > >   sk_devmem_info.binding in the case that user releases all tokens,
> > > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > > - removed some comments that were unnecessary
> > > >
> > > > Changes in v7:
> > > > - implement autorelease with static branch (Stan)
> > > > - use netlink instead of sockopt (Stan)
> > > > - merge uAPI and implementation patches into one patch (seemed less
> > > >   confusing)
> > > >
> > > > Changes in v6:
> > > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > > - move binding->autorelease check to outside of
> > > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > > - use niov without casting back and forth with netmem (Mina)
> > > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > > >   (Mina)
> > > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > > - remove single-binding restriction for autorelease mode (Mina)
> > > > - unbind always checks for leaked urefs
> > > >
> > > > Changes in v5:
> > > > - remove unused variables
> > > > - introduce autorelease flag, preparing for future patch toggle new
> > > >   behavior
> > > >
> > > > Changes in v3:
> > > > - make urefs per-binding instead of per-socket, reducing memory
> > > >   footprint
> > > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > > >   tokens
> > > > - drop ethtool patch
> > > >
> > > > Changes in v2:
> > > > - always use GFP_ZERO for binding->vec (Mina)
> > > > - remove WARN for changed binding (Mina)
> > > > - remove extraneous binding ref get (Mina)
> > > > - remove WARNs on invalid user input (Mina)
> > > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > > - fix length of alloc for urefs
> > > > ---
> > > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > > >  include/net/netmem.h                    |   1 +
> > > >  include/net/sock.h                      |   7 ++-
> > > >  include/uapi/linux/netdev.h             |   1 +
> > > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > > >  net/core/devmem.h                       |  27 ++++++++-
> > > >  net/core/netdev-genl-gen.c              |   5 +-
> > > >  net/core/netdev-genl.c                  |  10 ++-
> > > >  net/core/sock.c                         |  57 +++++++++++++++--
> > > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > > >  tools/include/uapi/linux/netdev.h       |   1 +
> > > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > > >
> > > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > > --- a/Documentation/netlink/specs/netdev.yaml
> > > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > > @@ -562,6 +562,17 @@ attribute-sets:
> > > >          type: u32
> > > >          checks:
> > > >            min: 1
> > > > +      -
> > > > +        name: autorelease
> > > > +        doc: |
> > > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > > +          released when the socket closes. If false (0), leaked tokens are only
> > > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > > +          specific mode, all subsequent bindings system-wide must use the same
> > > > +          mode.
> > > > +
> > > > +          Optional. Defaults to false if not specified.
> > > > +        type: u8
> > > >
> > > >  operations:
> > > >    list:
> > > > @@ -769,6 +780,7 @@ operations:
> > > >              - ifindex
> > > >              - fd
> > > >              - queues
> > > > +            - autorelease
> > > >          reply:
> > > >            attributes:
> > > >              - id
> > > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > > --- a/include/net/netmem.h
> > > > +++ b/include/net/netmem.h
> > > > @@ -112,6 +112,7 @@ struct net_iov {
> > > >         };
> > > >         struct net_iov_area *owner;
> > > >         enum net_iov_type type;
> > > > +       atomic_t uref;
> > > >  };
> > > >
> > > >  struct net_iov_area {
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -352,7 +352,7 @@ struct sk_filter;
> > > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > > >    *    @sk_scm_unused: unused flags for scm_recv()
> > > >    *    @ns_tracker: tracker for netns reference
> > > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > > >    *    @sk_owner: reference to the real owner of the socket that calls
> > > >    *               sock_lock_init_class_and_name().
> > > >    */
> > > > @@ -584,7 +584,10 @@ struct sock {
> > > >         struct numa_drop_counters *sk_drop_counters;
> > > >         struct rcu_head         sk_rcu;
> > > >         netns_tracker           ns_tracker;
> > > > -       struct xarray           sk_user_frags;
> > > > +       struct {
> > > > +               struct xarray                           frags;
> > > > +               struct net_devmem_dmabuf_binding        *binding;
> > > > +       } sk_devmem_info;
> > > >
> > > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > > >         struct module           *sk_owner;
> > > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > > index e0b579a1df4f..1e5c209cb998 100644
> > > > --- a/include/uapi/linux/netdev.h
> > > > +++ b/include/uapi/linux/netdev.h
> > > > @@ -207,6 +207,7 @@ enum {
> > > >         NETDEV_A_DMABUF_QUEUES,
> > > >         NETDEV_A_DMABUF_FD,
> > > >         NETDEV_A_DMABUF_ID,
> > > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > > >
> > > >         __NETDEV_A_DMABUF_MAX,
> > > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index 05a9a9e7abb9..05c16df657c7 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -11,6 +11,7 @@
> > > >  #include <linux/genalloc.h>
> > > >  #include <linux/mm.h>
> > > >  #include <linux/netdevice.h>
> > > > +#include <linux/skbuff_ref.h>
> > > >  #include <linux/types.h>
> > > >  #include <net/netdev_queues.h>
> > > >  #include <net/netdev_rx_queue.h>
> > > > @@ -28,6 +29,19 @@
> > > >
> > > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > > >
> > > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > > + * change until all tokens have been released (to avoid calling the wrong
> > > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > > + * and binding alloc/free atomic with regards to each other, using the
> > > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > > + * the outstanding token's references on the binding are dropped.
> > > > + */
> > > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > > +
> > > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > > +
> > > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > > >
> > > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > > >
> > > >         size_t size, avail;
> > > >
> > > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > > +               mutex_lock(&devmem_ar_lock);
> > > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > > +               mutex_unlock(&devmem_ar_lock);
> > > > +       }
> > > > +
> > >
> > > I find this loging with devmem_ar_lock and
> > > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > > can do another simplification here? Can we have it such that the first
> > > binding sets the system in autorelease on or autorelease off mode, and
> > > all future bindings maintain this state? We already don't support
> > > autorelease on/off mix.
> >
> > I think that would greatly simplify things. We would still need a lock
> > to make the static branch change and first release mode setting atomic WRT
> > each other, but the other parts (like the one above) can be
> > removed.
> >
> > >
> > >
> > > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > > >
> > > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > > >  }
> > > >
> > > > +static void
> > > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > > +{
> > > > +       int i;
> > > > +
> > > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > > +               struct net_iov *niov;
> > > > +               netmem_ref netmem;
> > > > +
> > > > +               niov = binding->vec[i];
> > > > +               netmem = net_iov_to_netmem(niov);
> > > > +
> > > > +               /* Multiple urefs map to only a single netmem ref. */
> > > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > > +       }
> > > > +}
> > > > +
> > > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > >  {
> > > >         struct netdev_rx_queue *rxq;
> > > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > > >         }
> > > >
> > > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> > >
> > > Sigh, I think what you're trying to do here is very complicated. You
> > > need to think about this scenario:
> > >
> > > 1. user binds dmabuf and opens a autorelease=off socket.
> > > 2. Data arrives on these sockets, and sits in the receive queues,
> > > recvmsg has not been called yet by the user.
> > > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > > the netmems in the receive queues.
> > > 5. user closes the socket.
> > >
> > > With autorelease=on, this works, because the binding remains alive
> > > until step 5 (even though it's unbound from the queue,
> > > ..._binding_free has not been called yet) and step 5 cleans up all
> > > references, even if the binding is unbound but alive, and
> > >
> > > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > > Autorelease=off implies the user must clean their urefs themselves,
> > > but we have this here in the unbind path, and it doesn't even
> > > guarantee that the urefs are free at this point because it may race
> > > with a recvmsg.
> > >
> > > Should we delete this uref cleanup here, and enforce that
> > > autorelease=off means that the user cleans up the references (the
> > > kernel never cleans them up on unbind or socket close)? The dontneed
> > > path needs to work whether the binding is active or unbound.
> > >
> >
> > I agree, I think we can do away with the "unbind drops references" idea.
> > A counter argument could be that it introduces the ability for one
> > process to interfere with another, but in fact that is already possible
> > with autorelease=on by not issuing dontneed and starving the other of
> > tokens.
> >
> 
> On second thought I don't think we can remove the references drop
> completely. AFAIU if the userspace misbehaves and doens't dontneed the
> netmems in this setup, then the binding will leak forever, which is
> really not great.
> 
> I think what may work is having a refcount on the binding for each
> rxqueue it's bound to and each socket that's using it. Once that
> refcount drops to 0, then we can be sure that the urefs in the binding
> are not in use anymore, and we can drop the urefs, which should make
> the binding refcount to hit 0 and the _binding_free() function to be
> called.
> 

That checks out. I guess a reasonable name to differentiate with
binding->ref might be binding->users?

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-13 20:32         ` Bobby Eshleman
@ 2026-01-13 20:42           ` Mina Almasry
  2026-01-14  3:18             ` Bobby Eshleman
  2026-01-14 17:04             ` Bobby Eshleman
  0 siblings, 2 replies; 21+ messages in thread
From: Mina Almasry @ 2026-01-13 20:42 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 13, 2026 at 12:32 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> On Tue, Jan 13, 2026 at 11:27:38AM -0800, Mina Almasry wrote:
> > On Mon, Jan 12, 2026 at 8:24 AM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > > > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > >
> > > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > >
> > > > > Add support for autorelease toggling of tokens using a static branch to
> > > > > control system-wide behavior. This allows applications to choose between
> > > > > two memory management modes:
> > > > >
> > > > > 1. Autorelease on: Leaked tokens are automatically released when the
> > > > >    socket closes.
> > > > >
> > > > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > > > >
> > > > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > > > binding is disallowed and is rejected by netlink. The system will be
> > > > > "locked" into the mode that the first binding is set to. It can only be
> > > > > changed again once there are zero bindings on the system.
> > > > >
> > > > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > > > >
> > > > > Static branching is used to limit the system to one mode or the other.
> > > > >
> > > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > > ---
> > > > > Changes in v9:
> > > > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > > > >   stubbed out when NET_DEVMEM=n
> > > > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > > > >   TX bindings)
> > > > >
> > > > > Changes in v8:
> > > > > - Only reset static key when bindings go to zero, defaulting back to
> > > > >   disabled (Stan).
> > > > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > > > >   use mutex instead.
> > > > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > > > >   the static key can not be changed while there are outstanding tokens
> > > > >   (free is only called when reference count reaches zero).
> > > > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > > > >   even after xa_erase(), so static key changes must wait until all
> > > > >   RX bindings are finally freed (not just when xarray is empty). A
> > > > >   counter is a simple way to track this.
> > > > > - socket takes reference on the binding, to avoid use-after-free on
> > > > >   sk_devmem_info.binding in the case that user releases all tokens,
> > > > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > > > - removed some comments that were unnecessary
> > > > >
> > > > > Changes in v7:
> > > > > - implement autorelease with static branch (Stan)
> > > > > - use netlink instead of sockopt (Stan)
> > > > > - merge uAPI and implementation patches into one patch (seemed less
> > > > >   confusing)
> > > > >
> > > > > Changes in v6:
> > > > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > > > - move binding->autorelease check to outside of
> > > > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > > > - use niov without casting back and forth with netmem (Mina)
> > > > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > > > >   (Mina)
> > > > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > > > - remove single-binding restriction for autorelease mode (Mina)
> > > > > - unbind always checks for leaked urefs
> > > > >
> > > > > Changes in v5:
> > > > > - remove unused variables
> > > > > - introduce autorelease flag, preparing for future patch toggle new
> > > > >   behavior
> > > > >
> > > > > Changes in v3:
> > > > > - make urefs per-binding instead of per-socket, reducing memory
> > > > >   footprint
> > > > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > > > >   tokens
> > > > > - drop ethtool patch
> > > > >
> > > > > Changes in v2:
> > > > > - always use GFP_ZERO for binding->vec (Mina)
> > > > > - remove WARN for changed binding (Mina)
> > > > > - remove extraneous binding ref get (Mina)
> > > > > - remove WARNs on invalid user input (Mina)
> > > > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > > > - fix length of alloc for urefs
> > > > > ---
> > > > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > > > >  include/net/netmem.h                    |   1 +
> > > > >  include/net/sock.h                      |   7 ++-
> > > > >  include/uapi/linux/netdev.h             |   1 +
> > > > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > > > >  net/core/devmem.h                       |  27 ++++++++-
> > > > >  net/core/netdev-genl-gen.c              |   5 +-
> > > > >  net/core/netdev-genl.c                  |  10 ++-
> > > > >  net/core/sock.c                         |  57 +++++++++++++++--
> > > > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > > > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > > > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > > > >  tools/include/uapi/linux/netdev.h       |   1 +
> > > > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > > > --- a/Documentation/netlink/specs/netdev.yaml
> > > > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > > > @@ -562,6 +562,17 @@ attribute-sets:
> > > > >          type: u32
> > > > >          checks:
> > > > >            min: 1
> > > > > +      -
> > > > > +        name: autorelease
> > > > > +        doc: |
> > > > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > > > +          released when the socket closes. If false (0), leaked tokens are only
> > > > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > > > +          specific mode, all subsequent bindings system-wide must use the same
> > > > > +          mode.
> > > > > +
> > > > > +          Optional. Defaults to false if not specified.
> > > > > +        type: u8
> > > > >
> > > > >  operations:
> > > > >    list:
> > > > > @@ -769,6 +780,7 @@ operations:
> > > > >              - ifindex
> > > > >              - fd
> > > > >              - queues
> > > > > +            - autorelease
> > > > >          reply:
> > > > >            attributes:
> > > > >              - id
> > > > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > > > --- a/include/net/netmem.h
> > > > > +++ b/include/net/netmem.h
> > > > > @@ -112,6 +112,7 @@ struct net_iov {
> > > > >         };
> > > > >         struct net_iov_area *owner;
> > > > >         enum net_iov_type type;
> > > > > +       atomic_t uref;
> > > > >  };
> > > > >
> > > > >  struct net_iov_area {
> > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > > > --- a/include/net/sock.h
> > > > > +++ b/include/net/sock.h
> > > > > @@ -352,7 +352,7 @@ struct sk_filter;
> > > > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > > > >    *    @sk_scm_unused: unused flags for scm_recv()
> > > > >    *    @ns_tracker: tracker for netns reference
> > > > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > > > >    *    @sk_owner: reference to the real owner of the socket that calls
> > > > >    *               sock_lock_init_class_and_name().
> > > > >    */
> > > > > @@ -584,7 +584,10 @@ struct sock {
> > > > >         struct numa_drop_counters *sk_drop_counters;
> > > > >         struct rcu_head         sk_rcu;
> > > > >         netns_tracker           ns_tracker;
> > > > > -       struct xarray           sk_user_frags;
> > > > > +       struct {
> > > > > +               struct xarray                           frags;
> > > > > +               struct net_devmem_dmabuf_binding        *binding;
> > > > > +       } sk_devmem_info;
> > > > >
> > > > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > > > >         struct module           *sk_owner;
> > > > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > > > index e0b579a1df4f..1e5c209cb998 100644
> > > > > --- a/include/uapi/linux/netdev.h
> > > > > +++ b/include/uapi/linux/netdev.h
> > > > > @@ -207,6 +207,7 @@ enum {
> > > > >         NETDEV_A_DMABUF_QUEUES,
> > > > >         NETDEV_A_DMABUF_FD,
> > > > >         NETDEV_A_DMABUF_ID,
> > > > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > > > >
> > > > >         __NETDEV_A_DMABUF_MAX,
> > > > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > index 05a9a9e7abb9..05c16df657c7 100644
> > > > > --- a/net/core/devmem.c
> > > > > +++ b/net/core/devmem.c
> > > > > @@ -11,6 +11,7 @@
> > > > >  #include <linux/genalloc.h>
> > > > >  #include <linux/mm.h>
> > > > >  #include <linux/netdevice.h>
> > > > > +#include <linux/skbuff_ref.h>
> > > > >  #include <linux/types.h>
> > > > >  #include <net/netdev_queues.h>
> > > > >  #include <net/netdev_rx_queue.h>
> > > > > @@ -28,6 +29,19 @@
> > > > >
> > > > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > > > >
> > > > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > > > + * change until all tokens have been released (to avoid calling the wrong
> > > > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > > > + * and binding alloc/free atomic with regards to each other, using the
> > > > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > > > + * the outstanding token's references on the binding are dropped.
> > > > > + */
> > > > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > > > +
> > > > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > > > +
> > > > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > > > >
> > > > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > > > >
> > > > >         size_t size, avail;
> > > > >
> > > > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > > > +               mutex_lock(&devmem_ar_lock);
> > > > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > > > +               mutex_unlock(&devmem_ar_lock);
> > > > > +       }
> > > > > +
> > > >
> > > > I find this loging with devmem_ar_lock and
> > > > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > > > can do another simplification here? Can we have it such that the first
> > > > binding sets the system in autorelease on or autorelease off mode, and
> > > > all future bindings maintain this state? We already don't support
> > > > autorelease on/off mix.
> > >
> > > I think that would greatly simplify things. We would still need a lock
> > > to make the static branch change and first release mode setting atomic WRT
> > > each other, but the other parts (like the one above) can be
> > > removed.
> > >
> > > >
> > > >
> > > > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > > > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > > > >
> > > > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > > > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > > > >  }
> > > > >
> > > > > +static void
> > > > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > > > +{
> > > > > +       int i;
> > > > > +
> > > > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > > > +               struct net_iov *niov;
> > > > > +               netmem_ref netmem;
> > > > > +
> > > > > +               niov = binding->vec[i];
> > > > > +               netmem = net_iov_to_netmem(niov);
> > > > > +
> > > > > +               /* Multiple urefs map to only a single netmem ref. */
> > > > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > > > +       }
> > > > > +}
> > > > > +
> > > > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > >  {
> > > > >         struct netdev_rx_queue *rxq;
> > > > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > > > >         }
> > > > >
> > > > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> > > >
> > > > Sigh, I think what you're trying to do here is very complicated. You
> > > > need to think about this scenario:
> > > >
> > > > 1. user binds dmabuf and opens a autorelease=off socket.
> > > > 2. Data arrives on these sockets, and sits in the receive queues,
> > > > recvmsg has not been called yet by the user.
> > > > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > > > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > > > the netmems in the receive queues.
> > > > 5. user closes the socket.
> > > >
> > > > With autorelease=on, this works, because the binding remains alive
> > > > until step 5 (even though it's unbound from the queue,
> > > > ..._binding_free has not been called yet) and step 5 cleans up all
> > > > references, even if the binding is unbound but alive, and
> > > >
> > > > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > > > Autorelease=off implies the user must clean their urefs themselves,
> > > > but we have this here in the unbind path, and it doesn't even
> > > > guarantee that the urefs are free at this point because it may race
> > > > with a recvmsg.
> > > >
> > > > Should we delete this uref cleanup here, and enforce that
> > > > autorelease=off means that the user cleans up the references (the
> > > > kernel never cleans them up on unbind or socket close)? The dontneed
> > > > path needs to work whether the binding is active or unbound.
> > > >
> > >
> > > I agree, I think we can do away with the "unbind drops references" idea.
> > > A counter argument could be that it introduces the ability for one
> > > process to interfere with another, but in fact that is already possible
> > > with autorelease=on by not issuing dontneed and starving the other of
> > > tokens.
> > >
> >
> > On second thought I don't think we can remove the references drop
> > completely. AFAIU if the userspace misbehaves and doens't dontneed the
> > netmems in this setup, then the binding will leak forever, which is
> > really not great.
> >
> > I think what may work is having a refcount on the binding for each
> > rxqueue it's bound to and each socket that's using it. Once that
> > refcount drops to 0, then we can be sure that the urefs in the binding
> > are not in use anymore, and we can drop the urefs, which should make
> > the binding refcount to hit 0 and the _binding_free() function to be
> > called.
> >
>
> That checks out. I guess a reasonable name to differentiate with
> binding->ref might be binding->users?
>

I was turning that exact same question in my head. It would be nice to
reuse binding->ref somehow, but without deep thinking I'm not sure
that's possible. It seems that binding->ref should guard
*_binding_free() being called when the binding is not used by anything
(remember that each netmem allocated from the binding has effectively
a ref on the page_pool it belong too, and the page_pool has a ref on
the binding, so effectively the binding never goes away until all the
netmems are returned to the pool).

and seemingly binding->users should guard *_put_urefs() being called
when we're user there is no userspace refs outstanding (no sockets
open and no rx queues active).


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-13 20:42           ` Mina Almasry
@ 2026-01-14  3:18             ` Bobby Eshleman
  2026-01-14 17:04             ` Bobby Eshleman
  1 sibling, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-14  3:18 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 13, 2026 at 12:42:50PM -0800, Mina Almasry wrote:
> On Tue, Jan 13, 2026 at 12:32 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 11:27:38AM -0800, Mina Almasry wrote:
> > > On Mon, Jan 12, 2026 at 8:24 AM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > > > > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > > >
> > > > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > > >
> > > > > > Add support for autorelease toggling of tokens using a static branch to
> > > > > > control system-wide behavior. This allows applications to choose between
> > > > > > two memory management modes:
> > > > > >
> > > > > > 1. Autorelease on: Leaked tokens are automatically released when the
> > > > > >    socket closes.
> > > > > >
> > > > > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > > > > >
> > > > > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > > > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > > > > binding is disallowed and is rejected by netlink. The system will be
> > > > > > "locked" into the mode that the first binding is set to. It can only be
> > > > > > changed again once there are zero bindings on the system.
> > > > > >
> > > > > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > > > > >
> > > > > > Static branching is used to limit the system to one mode or the other.
> > > > > >
> > > > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > > > ---
> > > > > > Changes in v9:
> > > > > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > > > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > > > > >   stubbed out when NET_DEVMEM=n
> > > > > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > > > > >   TX bindings)
> > > > > >
> > > > > > Changes in v8:
> > > > > > - Only reset static key when bindings go to zero, defaulting back to
> > > > > >   disabled (Stan).
> > > > > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > > > > >   use mutex instead.
> > > > > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > > > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > > > > >   the static key can not be changed while there are outstanding tokens
> > > > > >   (free is only called when reference count reaches zero).
> > > > > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > > > > >   even after xa_erase(), so static key changes must wait until all
> > > > > >   RX bindings are finally freed (not just when xarray is empty). A
> > > > > >   counter is a simple way to track this.
> > > > > > - socket takes reference on the binding, to avoid use-after-free on
> > > > > >   sk_devmem_info.binding in the case that user releases all tokens,
> > > > > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > > > > - removed some comments that were unnecessary
> > > > > >
> > > > > > Changes in v7:
> > > > > > - implement autorelease with static branch (Stan)
> > > > > > - use netlink instead of sockopt (Stan)
> > > > > > - merge uAPI and implementation patches into one patch (seemed less
> > > > > >   confusing)
> > > > > >
> > > > > > Changes in v6:
> > > > > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > > > > - move binding->autorelease check to outside of
> > > > > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > > > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > > > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > > > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > > > > - use niov without casting back and forth with netmem (Mina)
> > > > > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > > > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > > > > >   (Mina)
> > > > > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > > > > - remove single-binding restriction for autorelease mode (Mina)
> > > > > > - unbind always checks for leaked urefs
> > > > > >
> > > > > > Changes in v5:
> > > > > > - remove unused variables
> > > > > > - introduce autorelease flag, preparing for future patch toggle new
> > > > > >   behavior
> > > > > >
> > > > > > Changes in v3:
> > > > > > - make urefs per-binding instead of per-socket, reducing memory
> > > > > >   footprint
> > > > > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > > > > >   tokens
> > > > > > - drop ethtool patch
> > > > > >
> > > > > > Changes in v2:
> > > > > > - always use GFP_ZERO for binding->vec (Mina)
> > > > > > - remove WARN for changed binding (Mina)
> > > > > > - remove extraneous binding ref get (Mina)
> > > > > > - remove WARNs on invalid user input (Mina)
> > > > > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > > > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > > > > - fix length of alloc for urefs
> > > > > > ---
> > > > > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > > > > >  include/net/netmem.h                    |   1 +
> > > > > >  include/net/sock.h                      |   7 ++-
> > > > > >  include/uapi/linux/netdev.h             |   1 +
> > > > > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > > > > >  net/core/devmem.h                       |  27 ++++++++-
> > > > > >  net/core/netdev-genl-gen.c              |   5 +-
> > > > > >  net/core/netdev-genl.c                  |  10 ++-
> > > > > >  net/core/sock.c                         |  57 +++++++++++++++--
> > > > > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > > > > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > > > > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > > > > >  tools/include/uapi/linux/netdev.h       |   1 +
> > > > > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > > > > >
> > > > > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > > > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > > > > --- a/Documentation/netlink/specs/netdev.yaml
> > > > > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > > > > @@ -562,6 +562,17 @@ attribute-sets:
> > > > > >          type: u32
> > > > > >          checks:
> > > > > >            min: 1
> > > > > > +      -
> > > > > > +        name: autorelease
> > > > > > +        doc: |
> > > > > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > > > > +          released when the socket closes. If false (0), leaked tokens are only
> > > > > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > > > > +          specific mode, all subsequent bindings system-wide must use the same
> > > > > > +          mode.
> > > > > > +
> > > > > > +          Optional. Defaults to false if not specified.
> > > > > > +        type: u8
> > > > > >
> > > > > >  operations:
> > > > > >    list:
> > > > > > @@ -769,6 +780,7 @@ operations:
> > > > > >              - ifindex
> > > > > >              - fd
> > > > > >              - queues
> > > > > > +            - autorelease
> > > > > >          reply:
> > > > > >            attributes:
> > > > > >              - id
> > > > > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > > > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > > > > --- a/include/net/netmem.h
> > > > > > +++ b/include/net/netmem.h
> > > > > > @@ -112,6 +112,7 @@ struct net_iov {
> > > > > >         };
> > > > > >         struct net_iov_area *owner;
> > > > > >         enum net_iov_type type;
> > > > > > +       atomic_t uref;
> > > > > >  };
> > > > > >
> > > > > >  struct net_iov_area {
> > > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > > > > --- a/include/net/sock.h
> > > > > > +++ b/include/net/sock.h
> > > > > > @@ -352,7 +352,7 @@ struct sk_filter;
> > > > > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > > > > >    *    @sk_scm_unused: unused flags for scm_recv()
> > > > > >    *    @ns_tracker: tracker for netns reference
> > > > > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > > > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > > > > >    *    @sk_owner: reference to the real owner of the socket that calls
> > > > > >    *               sock_lock_init_class_and_name().
> > > > > >    */
> > > > > > @@ -584,7 +584,10 @@ struct sock {
> > > > > >         struct numa_drop_counters *sk_drop_counters;
> > > > > >         struct rcu_head         sk_rcu;
> > > > > >         netns_tracker           ns_tracker;
> > > > > > -       struct xarray           sk_user_frags;
> > > > > > +       struct {
> > > > > > +               struct xarray                           frags;
> > > > > > +               struct net_devmem_dmabuf_binding        *binding;
> > > > > > +       } sk_devmem_info;
> > > > > >
> > > > > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > > > > >         struct module           *sk_owner;
> > > > > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > > > > index e0b579a1df4f..1e5c209cb998 100644
> > > > > > --- a/include/uapi/linux/netdev.h
> > > > > > +++ b/include/uapi/linux/netdev.h
> > > > > > @@ -207,6 +207,7 @@ enum {
> > > > > >         NETDEV_A_DMABUF_QUEUES,
> > > > > >         NETDEV_A_DMABUF_FD,
> > > > > >         NETDEV_A_DMABUF_ID,
> > > > > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > > > > >
> > > > > >         __NETDEV_A_DMABUF_MAX,
> > > > > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > index 05a9a9e7abb9..05c16df657c7 100644
> > > > > > --- a/net/core/devmem.c
> > > > > > +++ b/net/core/devmem.c
> > > > > > @@ -11,6 +11,7 @@
> > > > > >  #include <linux/genalloc.h>
> > > > > >  #include <linux/mm.h>
> > > > > >  #include <linux/netdevice.h>
> > > > > > +#include <linux/skbuff_ref.h>
> > > > > >  #include <linux/types.h>
> > > > > >  #include <net/netdev_queues.h>
> > > > > >  #include <net/netdev_rx_queue.h>
> > > > > > @@ -28,6 +29,19 @@
> > > > > >
> > > > > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > > > > >
> > > > > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > > > > + * change until all tokens have been released (to avoid calling the wrong
> > > > > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > > > > + * and binding alloc/free atomic with regards to each other, using the
> > > > > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > > > > + * the outstanding token's references on the binding are dropped.
> > > > > > + */
> > > > > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > > > > +
> > > > > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > > > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > > > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > > > > +
> > > > > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > > > > >
> > > > > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > > > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > > > > >
> > > > > >         size_t size, avail;
> > > > > >
> > > > > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > > > > +               mutex_lock(&devmem_ar_lock);
> > > > > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > > > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > > > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > > > > +               mutex_unlock(&devmem_ar_lock);
> > > > > > +       }
> > > > > > +
> > > > >
> > > > > I find this loging with devmem_ar_lock and
> > > > > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > > > > can do another simplification here? Can we have it such that the first
> > > > > binding sets the system in autorelease on or autorelease off mode, and
> > > > > all future bindings maintain this state? We already don't support
> > > > > autorelease on/off mix.
> > > >
> > > > I think that would greatly simplify things. We would still need a lock
> > > > to make the static branch change and first release mode setting atomic WRT
> > > > each other, but the other parts (like the one above) can be
> > > > removed.
> > > >
> > > > >
> > > > >
> > > > > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > > > > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > > > > >
> > > > > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > > > > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > > > > >  }
> > > > > >
> > > > > > +static void
> > > > > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > > > > +{
> > > > > > +       int i;
> > > > > > +
> > > > > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > > > > +               struct net_iov *niov;
> > > > > > +               netmem_ref netmem;
> > > > > > +
> > > > > > +               niov = binding->vec[i];
> > > > > > +               netmem = net_iov_to_netmem(niov);
> > > > > > +
> > > > > > +               /* Multiple urefs map to only a single netmem ref. */
> > > > > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > > > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > > >  {
> > > > > >         struct netdev_rx_queue *rxq;
> > > > > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > > > > >         }
> > > > > >
> > > > > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> > > > >
> > > > > Sigh, I think what you're trying to do here is very complicated. You
> > > > > need to think about this scenario:
> > > > >
> > > > > 1. user binds dmabuf and opens a autorelease=off socket.
> > > > > 2. Data arrives on these sockets, and sits in the receive queues,
> > > > > recvmsg has not been called yet by the user.
> > > > > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > > > > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > > > > the netmems in the receive queues.
> > > > > 5. user closes the socket.
> > > > >
> > > > > With autorelease=on, this works, because the binding remains alive
> > > > > until step 5 (even though it's unbound from the queue,
> > > > > ..._binding_free has not been called yet) and step 5 cleans up all
> > > > > references, even if the binding is unbound but alive, and
> > > > >
> > > > > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > > > > Autorelease=off implies the user must clean their urefs themselves,
> > > > > but we have this here in the unbind path, and it doesn't even
> > > > > guarantee that the urefs are free at this point because it may race
> > > > > with a recvmsg.
> > > > >
> > > > > Should we delete this uref cleanup here, and enforce that
> > > > > autorelease=off means that the user cleans up the references (the
> > > > > kernel never cleans them up on unbind or socket close)? The dontneed
> > > > > path needs to work whether the binding is active or unbound.
> > > > >
> > > >
> > > > I agree, I think we can do away with the "unbind drops references" idea.
> > > > A counter argument could be that it introduces the ability for one
> > > > process to interfere with another, but in fact that is already possible
> > > > with autorelease=on by not issuing dontneed and starving the other of
> > > > tokens.
> > > >
> > >
> > > On second thought I don't think we can remove the references drop
> > > completely. AFAIU if the userspace misbehaves and doens't dontneed the
> > > netmems in this setup, then the binding will leak forever, which is
> > > really not great.
> > >
> > > I think what may work is having a refcount on the binding for each
> > > rxqueue it's bound to and each socket that's using it. Once that
> > > refcount drops to 0, then we can be sure that the urefs in the binding
> > > are not in use anymore, and we can drop the urefs, which should make
> > > the binding refcount to hit 0 and the _binding_free() function to be
> > > called.
> > >
> >
> > That checks out. I guess a reasonable name to differentiate with
> > binding->ref might be binding->users?
> >
> 
> I was turning that exact same question in my head. It would be nice to
> reuse binding->ref somehow, but without deep thinking I'm not sure
> that's possible. It seems that binding->ref should guard
> *_binding_free() being called when the binding is not used by anything
> (remember that each netmem allocated from the binding has effectively
> a ref on the page_pool it belong too, and the page_pool has a ref on
> the binding, so effectively the binding never goes away until all the
> netmems are returned to the pool).
> 
> and seemingly binding->users should guard *_put_urefs() being called
> when we're user there is no userspace refs outstanding (no sockets
> open and no rx queues active).
> 

I've been mulling over this and I think this is the best way, and
definitely solves the cases we discussed earlier.

I'll draft up the next rev and see if we can poke holes in it.

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-13 20:42           ` Mina Almasry
  2026-01-14  3:18             ` Bobby Eshleman
@ 2026-01-14 17:04             ` Bobby Eshleman
  1 sibling, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-14 17:04 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Tue, Jan 13, 2026 at 12:42:50PM -0800, Mina Almasry wrote:
> On Tue, Jan 13, 2026 at 12:32 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 11:27:38AM -0800, Mina Almasry wrote:
> > > On Mon, Jan 12, 2026 at 8:24 AM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > > > > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > > >
> > > > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > > >
> > > > > > Add support for autorelease toggling of tokens using a static branch to
> > > > > > control system-wide behavior. This allows applications to choose between
> > > > > > two memory management modes:
> > > > > >
> > > > > > 1. Autorelease on: Leaked tokens are automatically released when the
> > > > > >    socket closes.
> > > > > >
> > > > > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > > > > >
> > > > > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > > > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > > > > binding is disallowed and is rejected by netlink. The system will be
> > > > > > "locked" into the mode that the first binding is set to. It can only be
> > > > > > changed again once there are zero bindings on the system.
> > > > > >
> > > > > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > > > > >
> > > > > > Static branching is used to limit the system to one mode or the other.
> > > > > >
> > > > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > > > ---
> > > > > > Changes in v9:
> > > > > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > > > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > > > > >   stubbed out when NET_DEVMEM=n
> > > > > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > > > > >   TX bindings)
> > > > > >
> > > > > > Changes in v8:
> > > > > > - Only reset static key when bindings go to zero, defaulting back to
> > > > > >   disabled (Stan).
> > > > > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > > > > >   use mutex instead.
> > > > > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > > > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > > > > >   the static key can not be changed while there are outstanding tokens
> > > > > >   (free is only called when reference count reaches zero).
> > > > > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > > > > >   even after xa_erase(), so static key changes must wait until all
> > > > > >   RX bindings are finally freed (not just when xarray is empty). A
> > > > > >   counter is a simple way to track this.
> > > > > > - socket takes reference on the binding, to avoid use-after-free on
> > > > > >   sk_devmem_info.binding in the case that user releases all tokens,
> > > > > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > > > > - removed some comments that were unnecessary
> > > > > >
> > > > > > Changes in v7:
> > > > > > - implement autorelease with static branch (Stan)
> > > > > > - use netlink instead of sockopt (Stan)
> > > > > > - merge uAPI and implementation patches into one patch (seemed less
> > > > > >   confusing)
> > > > > >
> > > > > > Changes in v6:
> > > > > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > > > > - move binding->autorelease check to outside of
> > > > > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > > > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > > > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > > > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > > > > - use niov without casting back and forth with netmem (Mina)
> > > > > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > > > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > > > > >   (Mina)
> > > > > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > > > > - remove single-binding restriction for autorelease mode (Mina)
> > > > > > - unbind always checks for leaked urefs
> > > > > >
> > > > > > Changes in v5:
> > > > > > - remove unused variables
> > > > > > - introduce autorelease flag, preparing for future patch toggle new
> > > > > >   behavior
> > > > > >
> > > > > > Changes in v3:
> > > > > > - make urefs per-binding instead of per-socket, reducing memory
> > > > > >   footprint
> > > > > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > > > > >   tokens
> > > > > > - drop ethtool patch
> > > > > >
> > > > > > Changes in v2:
> > > > > > - always use GFP_ZERO for binding->vec (Mina)
> > > > > > - remove WARN for changed binding (Mina)
> > > > > > - remove extraneous binding ref get (Mina)
> > > > > > - remove WARNs on invalid user input (Mina)
> > > > > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > > > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > > > > - fix length of alloc for urefs
> > > > > > ---
> > > > > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > > > > >  include/net/netmem.h                    |   1 +
> > > > > >  include/net/sock.h                      |   7 ++-
> > > > > >  include/uapi/linux/netdev.h             |   1 +
> > > > > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > > > > >  net/core/devmem.h                       |  27 ++++++++-
> > > > > >  net/core/netdev-genl-gen.c              |   5 +-
> > > > > >  net/core/netdev-genl.c                  |  10 ++-
> > > > > >  net/core/sock.c                         |  57 +++++++++++++++--
> > > > > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > > > > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > > > > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > > > > >  tools/include/uapi/linux/netdev.h       |   1 +
> > > > > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > > > > >
> > > > > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > > > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > > > > --- a/Documentation/netlink/specs/netdev.yaml
> > > > > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > > > > @@ -562,6 +562,17 @@ attribute-sets:
> > > > > >          type: u32
> > > > > >          checks:
> > > > > >            min: 1
> > > > > > +      -
> > > > > > +        name: autorelease
> > > > > > +        doc: |
> > > > > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > > > > +          released when the socket closes. If false (0), leaked tokens are only
> > > > > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > > > > +          specific mode, all subsequent bindings system-wide must use the same
> > > > > > +          mode.
> > > > > > +
> > > > > > +          Optional. Defaults to false if not specified.
> > > > > > +        type: u8
> > > > > >
> > > > > >  operations:
> > > > > >    list:
> > > > > > @@ -769,6 +780,7 @@ operations:
> > > > > >              - ifindex
> > > > > >              - fd
> > > > > >              - queues
> > > > > > +            - autorelease
> > > > > >          reply:
> > > > > >            attributes:
> > > > > >              - id
> > > > > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > > > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > > > > --- a/include/net/netmem.h
> > > > > > +++ b/include/net/netmem.h
> > > > > > @@ -112,6 +112,7 @@ struct net_iov {
> > > > > >         };
> > > > > >         struct net_iov_area *owner;
> > > > > >         enum net_iov_type type;
> > > > > > +       atomic_t uref;
> > > > > >  };
> > > > > >
> > > > > >  struct net_iov_area {
> > > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > > > > --- a/include/net/sock.h
> > > > > > +++ b/include/net/sock.h
> > > > > > @@ -352,7 +352,7 @@ struct sk_filter;
> > > > > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > > > > >    *    @sk_scm_unused: unused flags for scm_recv()
> > > > > >    *    @ns_tracker: tracker for netns reference
> > > > > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > > > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > > > > >    *    @sk_owner: reference to the real owner of the socket that calls
> > > > > >    *               sock_lock_init_class_and_name().
> > > > > >    */
> > > > > > @@ -584,7 +584,10 @@ struct sock {
> > > > > >         struct numa_drop_counters *sk_drop_counters;
> > > > > >         struct rcu_head         sk_rcu;
> > > > > >         netns_tracker           ns_tracker;
> > > > > > -       struct xarray           sk_user_frags;
> > > > > > +       struct {
> > > > > > +               struct xarray                           frags;
> > > > > > +               struct net_devmem_dmabuf_binding        *binding;
> > > > > > +       } sk_devmem_info;
> > > > > >
> > > > > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > > > > >         struct module           *sk_owner;
> > > > > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > > > > index e0b579a1df4f..1e5c209cb998 100644
> > > > > > --- a/include/uapi/linux/netdev.h
> > > > > > +++ b/include/uapi/linux/netdev.h
> > > > > > @@ -207,6 +207,7 @@ enum {
> > > > > >         NETDEV_A_DMABUF_QUEUES,
> > > > > >         NETDEV_A_DMABUF_FD,
> > > > > >         NETDEV_A_DMABUF_ID,
> > > > > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > > > > >
> > > > > >         __NETDEV_A_DMABUF_MAX,
> > > > > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > index 05a9a9e7abb9..05c16df657c7 100644
> > > > > > --- a/net/core/devmem.c
> > > > > > +++ b/net/core/devmem.c
> > > > > > @@ -11,6 +11,7 @@
> > > > > >  #include <linux/genalloc.h>
> > > > > >  #include <linux/mm.h>
> > > > > >  #include <linux/netdevice.h>
> > > > > > +#include <linux/skbuff_ref.h>
> > > > > >  #include <linux/types.h>
> > > > > >  #include <net/netdev_queues.h>
> > > > > >  #include <net/netdev_rx_queue.h>
> > > > > > @@ -28,6 +29,19 @@
> > > > > >
> > > > > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > > > > >
> > > > > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > > > > + * change until all tokens have been released (to avoid calling the wrong
> > > > > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > > > > + * and binding alloc/free atomic with regards to each other, using the
> > > > > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > > > > + * the outstanding token's references on the binding are dropped.
> > > > > > + */
> > > > > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > > > > +
> > > > > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > > > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > > > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > > > > +
> > > > > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > > > > >
> > > > > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > > > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > > > > >
> > > > > >         size_t size, avail;
> > > > > >
> > > > > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > > > > +               mutex_lock(&devmem_ar_lock);
> > > > > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > > > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > > > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > > > > +               mutex_unlock(&devmem_ar_lock);
> > > > > > +       }
> > > > > > +
> > > > >
> > > > > I find this loging with devmem_ar_lock and
> > > > > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > > > > can do another simplification here? Can we have it such that the first
> > > > > binding sets the system in autorelease on or autorelease off mode, and
> > > > > all future bindings maintain this state? We already don't support
> > > > > autorelease on/off mix.
> > > >
> > > > I think that would greatly simplify things. We would still need a lock
> > > > to make the static branch change and first release mode setting atomic WRT
> > > > each other, but the other parts (like the one above) can be
> > > > removed.
> > > >
> > > > >
> > > > >
> > > > > >         gen_pool_for_each_chunk(binding->chunk_pool,
> > > > > >                                 net_devmem_dmabuf_free_chunk_owner, NULL);
> > > > > >
> > > > > > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > > > > >         gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > > > > >  }
> > > > > >
> > > > > > +static void
> > > > > > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > > > > > +{
> > > > > > +       int i;
> > > > > > +
> > > > > > +       for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > > > > > +               struct net_iov *niov;
> > > > > > +               netmem_ref netmem;
> > > > > > +
> > > > > > +               niov = binding->vec[i];
> > > > > > +               netmem = net_iov_to_netmem(niov);
> > > > > > +
> > > > > > +               /* Multiple urefs map to only a single netmem ref. */
> > > > > > +               if (atomic_xchg(&niov->uref, 0) > 0)
> > > > > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > >  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > > >  {
> > > > > >         struct netdev_rx_queue *rxq;
> > > > > > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > > > > >                 __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > > > > >         }
> > > > > >
> > > > > > +       net_devmem_dmabuf_binding_put_urefs(binding);
> > > > >
> > > > > Sigh, I think what you're trying to do here is very complicated. You
> > > > > need to think about this scenario:
> > > > >
> > > > > 1. user binds dmabuf and opens a autorelease=off socket.
> > > > > 2. Data arrives on these sockets, and sits in the receive queues,
> > > > > recvmsg has not been called yet by the user.
> > > > > 3. User unbinds the dma-buff, netmems are still in the receive queues.
> > > > > 4. User calls recvmsg on one of these sockets, which obtains a uref on
> > > > > the netmems in the receive queues.
> > > > > 5. user closes the socket.
> > > > >
> > > > > With autorelease=on, this works, because the binding remains alive
> > > > > until step 5 (even though it's unbound from the queue,
> > > > > ..._binding_free has not been called yet) and step 5 cleans up all
> > > > > references, even if the binding is unbound but alive, and
> > > > >
> > > > > calling net_devmem_dmabuf_binding_put_urefs here is weird.
> > > > > Autorelease=off implies the user must clean their urefs themselves,
> > > > > but we have this here in the unbind path, and it doesn't even
> > > > > guarantee that the urefs are free at this point because it may race
> > > > > with a recvmsg.
> > > > >
> > > > > Should we delete this uref cleanup here, and enforce that
> > > > > autorelease=off means that the user cleans up the references (the
> > > > > kernel never cleans them up on unbind or socket close)? The dontneed
> > > > > path needs to work whether the binding is active or unbound.
> > > > >
> > > >
> > > > I agree, I think we can do away with the "unbind drops references" idea.
> > > > A counter argument could be that it introduces the ability for one
> > > > process to interfere with another, but in fact that is already possible
> > > > with autorelease=on by not issuing dontneed and starving the other of
> > > > tokens.
> > > >
> > >
> > > On second thought I don't think we can remove the references drop
> > > completely. AFAIU if the userspace misbehaves and doens't dontneed the
> > > netmems in this setup, then the binding will leak forever, which is
> > > really not great.
> > >
> > > I think what may work is having a refcount on the binding for each
> > > rxqueue it's bound to and each socket that's using it. Once that
> > > refcount drops to 0, then we can be sure that the urefs in the binding
> > > are not in use anymore, and we can drop the urefs, which should make
> > > the binding refcount to hit 0 and the _binding_free() function to be
> > > called.
> > >
> >
> > That checks out. I guess a reasonable name to differentiate with
> > binding->ref might be binding->users?
> >
> 
> I was turning that exact same question in my head. It would be nice to
> reuse binding->ref somehow, but without deep thinking I'm not sure
> that's possible. It seems that binding->ref should guard
> *_binding_free() being called when the binding is not used by anything
> (remember that each netmem allocated from the binding has effectively
> a ref on the page_pool it belong too, and the page_pool has a ref on
> the binding, so effectively the binding never goes away until all the
> netmems are returned to the pool).
> 
> and seemingly binding->users should guard *_put_urefs() being called
> when we're user there is no userspace refs outstanding (no sockets
> open and no rx queues active).
> 
> 
> -- 
> Thanks,
> Mina

I know this is late in the game, but I'm considering a name change:

From NETDEV_A_DMABUF_AUTORELEASE to NETDEV_A_DMABUF_COMPAT_DONTNEED.

I think this will make it less confusing which one is which. WDYT?

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-12 16:24     ` Bobby Eshleman
  2026-01-13 19:27       ` Mina Almasry
@ 2026-01-14 20:54       ` Stanislav Fomichev
  2026-01-14 21:25         ` Bobby Eshleman
  1 sibling, 1 reply; 21+ messages in thread
From: Stanislav Fomichev @ 2026-01-14 20:54 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On 01/12, Bobby Eshleman wrote:
> On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> > > Add support for autorelease toggling of tokens using a static branch to
> > > control system-wide behavior. This allows applications to choose between
> > > two memory management modes:
> > >
> > > 1. Autorelease on: Leaked tokens are automatically released when the
> > >    socket closes.
> > >
> > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > >
> > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > binding is disallowed and is rejected by netlink. The system will be
> > > "locked" into the mode that the first binding is set to. It can only be
> > > changed again once there are zero bindings on the system.
> > >
> > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > >
> > > Static branching is used to limit the system to one mode or the other.
> > >
> > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > ---
> > > Changes in v9:
> > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > >   stubbed out when NET_DEVMEM=n
> > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > >   TX bindings)
> > >
> > > Changes in v8:
> > > - Only reset static key when bindings go to zero, defaulting back to
> > >   disabled (Stan).
> > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > >   use mutex instead.
> > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > >   the static key can not be changed while there are outstanding tokens
> > >   (free is only called when reference count reaches zero).
> > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > >   even after xa_erase(), so static key changes must wait until all
> > >   RX bindings are finally freed (not just when xarray is empty). A
> > >   counter is a simple way to track this.
> > > - socket takes reference on the binding, to avoid use-after-free on
> > >   sk_devmem_info.binding in the case that user releases all tokens,
> > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > - removed some comments that were unnecessary
> > >
> > > Changes in v7:
> > > - implement autorelease with static branch (Stan)
> > > - use netlink instead of sockopt (Stan)
> > > - merge uAPI and implementation patches into one patch (seemed less
> > >   confusing)
> > >
> > > Changes in v6:
> > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > - move binding->autorelease check to outside of
> > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > - use niov without casting back and forth with netmem (Mina)
> > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > >   (Mina)
> > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > - remove single-binding restriction for autorelease mode (Mina)
> > > - unbind always checks for leaked urefs
> > >
> > > Changes in v5:
> > > - remove unused variables
> > > - introduce autorelease flag, preparing for future patch toggle new
> > >   behavior
> > >
> > > Changes in v3:
> > > - make urefs per-binding instead of per-socket, reducing memory
> > >   footprint
> > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > >   tokens
> > > - drop ethtool patch
> > >
> > > Changes in v2:
> > > - always use GFP_ZERO for binding->vec (Mina)
> > > - remove WARN for changed binding (Mina)
> > > - remove extraneous binding ref get (Mina)
> > > - remove WARNs on invalid user input (Mina)
> > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > - fix length of alloc for urefs
> > > ---
> > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > >  include/net/netmem.h                    |   1 +
> > >  include/net/sock.h                      |   7 ++-
> > >  include/uapi/linux/netdev.h             |   1 +
> > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > >  net/core/devmem.h                       |  27 ++++++++-
> > >  net/core/netdev-genl-gen.c              |   5 +-
> > >  net/core/netdev-genl.c                  |  10 ++-
> > >  net/core/sock.c                         |  57 +++++++++++++++--
> > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > >  tools/include/uapi/linux/netdev.h       |   1 +
> > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > --- a/Documentation/netlink/specs/netdev.yaml
> > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > @@ -562,6 +562,17 @@ attribute-sets:
> > >          type: u32
> > >          checks:
> > >            min: 1
> > > +      -
> > > +        name: autorelease
> > > +        doc: |
> > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > +          released when the socket closes. If false (0), leaked tokens are only
> > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > +          specific mode, all subsequent bindings system-wide must use the same
> > > +          mode.
> > > +
> > > +          Optional. Defaults to false if not specified.
> > > +        type: u8
> > >
> > >  operations:
> > >    list:
> > > @@ -769,6 +780,7 @@ operations:
> > >              - ifindex
> > >              - fd
> > >              - queues
> > > +            - autorelease
> > >          reply:
> > >            attributes:
> > >              - id
> > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > --- a/include/net/netmem.h
> > > +++ b/include/net/netmem.h
> > > @@ -112,6 +112,7 @@ struct net_iov {
> > >         };
> > >         struct net_iov_area *owner;
> > >         enum net_iov_type type;
> > > +       atomic_t uref;
> > >  };
> > >
> > >  struct net_iov_area {
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -352,7 +352,7 @@ struct sk_filter;
> > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > >    *    @sk_scm_unused: unused flags for scm_recv()
> > >    *    @ns_tracker: tracker for netns reference
> > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > >    *    @sk_owner: reference to the real owner of the socket that calls
> > >    *               sock_lock_init_class_and_name().
> > >    */
> > > @@ -584,7 +584,10 @@ struct sock {
> > >         struct numa_drop_counters *sk_drop_counters;
> > >         struct rcu_head         sk_rcu;
> > >         netns_tracker           ns_tracker;
> > > -       struct xarray           sk_user_frags;
> > > +       struct {
> > > +               struct xarray                           frags;
> > > +               struct net_devmem_dmabuf_binding        *binding;
> > > +       } sk_devmem_info;
> > >
> > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > >         struct module           *sk_owner;
> > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > index e0b579a1df4f..1e5c209cb998 100644
> > > --- a/include/uapi/linux/netdev.h
> > > +++ b/include/uapi/linux/netdev.h
> > > @@ -207,6 +207,7 @@ enum {
> > >         NETDEV_A_DMABUF_QUEUES,
> > >         NETDEV_A_DMABUF_FD,
> > >         NETDEV_A_DMABUF_ID,
> > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > >
> > >         __NETDEV_A_DMABUF_MAX,
> > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index 05a9a9e7abb9..05c16df657c7 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/genalloc.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/netdevice.h>
> > > +#include <linux/skbuff_ref.h>
> > >  #include <linux/types.h>
> > >  #include <net/netdev_queues.h>
> > >  #include <net/netdev_rx_queue.h>
> > > @@ -28,6 +29,19 @@
> > >
> > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > >
> > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > + * change until all tokens have been released (to avoid calling the wrong
> > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > + * and binding alloc/free atomic with regards to each other, using the
> > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > + * the outstanding token's references on the binding are dropped.
> > > + */
> > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > +
> > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > +
> > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > >
> > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > >
> > >         size_t size, avail;
> > >
> > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > +               mutex_lock(&devmem_ar_lock);
> > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > +               mutex_unlock(&devmem_ar_lock);
> > > +       }
> > > +
> > 
> > I find this loging with devmem_ar_lock and
> > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > can do another simplification here? Can we have it such that the first
> > binding sets the system in autorelease on or autorelease off mode, and
> > all future bindings maintain this state? We already don't support
> > autorelease on/off mix.
> 
> I think that would greatly simplify things. We would still need a lock
> to make the static branch change and first release mode setting atomic WRT
> each other, but the other parts (like the one above) can be
> removed.

I'm not against this, but I wonder how we can test both modes on NIPA?
If we lock the mode, we can only test one mode until the kernel
reboots... I wonder whether with your proposed refcnt changes (in the
other thread) we can keep existing mode (where we don't need a reboot).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token management
  2026-01-14 20:54       ` Stanislav Fomichev
@ 2026-01-14 21:25         ` Bobby Eshleman
  0 siblings, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-14 21:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter, Stanislav Fomichev,
	netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Wed, Jan 14, 2026 at 12:54:44PM -0800, Stanislav Fomichev wrote:
> On 01/12, Bobby Eshleman wrote:
> > On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> > > On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > >
> > > > Add support for autorelease toggling of tokens using a static branch to
> > > > control system-wide behavior. This allows applications to choose between
> > > > two memory management modes:
> > > >
> > > > 1. Autorelease on: Leaked tokens are automatically released when the
> > > >    socket closes.
> > > >
> > > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> > > >
> > > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > > > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > > > binding is disallowed and is rejected by netlink. The system will be
> > > > "locked" into the mode that the first binding is set to. It can only be
> > > > changed again once there are zero bindings on the system.
> > > >
> > > > Disabling autorelease offers ~13% improvement in CPU utilization.
> > > >
> > > > Static branching is used to limit the system to one mode or the other.
> > > >
> > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > > > ---
> > > > Changes in v9:
> > > > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > > > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > > >   stubbed out when NET_DEVMEM=n
> > > > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > > >   TX bindings)
> > > >
> > > > Changes in v8:
> > > > - Only reset static key when bindings go to zero, defaulting back to
> > > >   disabled (Stan).
> > > > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > > >   use mutex instead.
> > > > - Access pp_ref_count via niov->desc instead of niov directly.
> > > > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > > >   the static key can not be changed while there are outstanding tokens
> > > >   (free is only called when reference count reaches zero).
> > > > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > > >   even after xa_erase(), so static key changes must wait until all
> > > >   RX bindings are finally freed (not just when xarray is empty). A
> > > >   counter is a simple way to track this.
> > > > - socket takes reference on the binding, to avoid use-after-free on
> > > >   sk_devmem_info.binding in the case that user releases all tokens,
> > > >   unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > > > - removed some comments that were unnecessary
> > > >
> > > > Changes in v7:
> > > > - implement autorelease with static branch (Stan)
> > > > - use netlink instead of sockopt (Stan)
> > > > - merge uAPI and implementation patches into one patch (seemed less
> > > >   confusing)
> > > >
> > > > Changes in v6:
> > > > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > > > - move binding->autorelease check to outside of
> > > >   net_devmem_dmabuf_binding_put_urefs() (Mina)
> > > > - remove overly defensive net_is_devmem_iov() (Mina)
> > > > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > > > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > > > - use niov without casting back and forth with netmem (Mina)
> > > > - move the autorelease flag from per-binding to per-socket (Mina)
> > > > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > > >   (Mina)
> > > > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > > > - remove single-binding restriction for autorelease mode (Mina)
> > > > - unbind always checks for leaked urefs
> > > >
> > > > Changes in v5:
> > > > - remove unused variables
> > > > - introduce autorelease flag, preparing for future patch toggle new
> > > >   behavior
> > > >
> > > > Changes in v3:
> > > > - make urefs per-binding instead of per-socket, reducing memory
> > > >   footprint
> > > > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > > >   tokens
> > > > - drop ethtool patch
> > > >
> > > > Changes in v2:
> > > > - always use GFP_ZERO for binding->vec (Mina)
> > > > - remove WARN for changed binding (Mina)
> > > > - remove extraneous binding ref get (Mina)
> > > > - remove WARNs on invalid user input (Mina)
> > > > - pre-assign niovs in binding->vec for RX case (Mina)
> > > > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > > > - fix length of alloc for urefs
> > > > ---
> > > >  Documentation/netlink/specs/netdev.yaml |  12 ++++
> > > >  include/net/netmem.h                    |   1 +
> > > >  include/net/sock.h                      |   7 ++-
> > > >  include/uapi/linux/netdev.h             |   1 +
> > > >  net/core/devmem.c                       | 104 ++++++++++++++++++++++++++++----
> > > >  net/core/devmem.h                       |  27 ++++++++-
> > > >  net/core/netdev-genl-gen.c              |   5 +-
> > > >  net/core/netdev-genl.c                  |  10 ++-
> > > >  net/core/sock.c                         |  57 +++++++++++++++--
> > > >  net/ipv4/tcp.c                          |  76 ++++++++++++++++++-----
> > > >  net/ipv4/tcp_ipv4.c                     |  11 +++-
> > > >  net/ipv4/tcp_minisocks.c                |   3 +-
> > > >  tools/include/uapi/linux/netdev.h       |   1 +
> > > >  13 files changed, 269 insertions(+), 46 deletions(-)
> > > >
> > > > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > > > index 596c306ce52b..7cbe9e7b9ee5 100644
> > > > --- a/Documentation/netlink/specs/netdev.yaml
> > > > +++ b/Documentation/netlink/specs/netdev.yaml
> > > > @@ -562,6 +562,17 @@ attribute-sets:
> > > >          type: u32
> > > >          checks:
> > > >            min: 1
> > > > +      -
> > > > +        name: autorelease
> > > > +        doc: |
> > > > +          Token autorelease mode. If true (1), leaked tokens are automatically
> > > > +          released when the socket closes. If false (0), leaked tokens are only
> > > > +          released when the dmabuf is unbound. Once a binding is created with a
> > > > +          specific mode, all subsequent bindings system-wide must use the same
> > > > +          mode.
> > > > +
> > > > +          Optional. Defaults to false if not specified.
> > > > +        type: u8
> > > >
> > > >  operations:
> > > >    list:
> > > > @@ -769,6 +780,7 @@ operations:
> > > >              - ifindex
> > > >              - fd
> > > >              - queues
> > > > +            - autorelease
> > > >          reply:
> > > >            attributes:
> > > >              - id
> > > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > > index 9e10f4ac50c3..80d2263ba4ed 100644
> > > > --- a/include/net/netmem.h
> > > > +++ b/include/net/netmem.h
> > > > @@ -112,6 +112,7 @@ struct net_iov {
> > > >         };
> > > >         struct net_iov_area *owner;
> > > >         enum net_iov_type type;
> > > > +       atomic_t uref;
> > > >  };
> > > >
> > > >  struct net_iov_area {
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -352,7 +352,7 @@ struct sk_filter;
> > > >    *    @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > > >    *    @sk_scm_unused: unused flags for scm_recv()
> > > >    *    @ns_tracker: tracker for netns reference
> > > > -  *    @sk_user_frags: xarray of pages the user is holding a reference on.
> > > > +  *    @sk_devmem_info: the devmem binding information for the socket
> > > >    *    @sk_owner: reference to the real owner of the socket that calls
> > > >    *               sock_lock_init_class_and_name().
> > > >    */
> > > > @@ -584,7 +584,10 @@ struct sock {
> > > >         struct numa_drop_counters *sk_drop_counters;
> > > >         struct rcu_head         sk_rcu;
> > > >         netns_tracker           ns_tracker;
> > > > -       struct xarray           sk_user_frags;
> > > > +       struct {
> > > > +               struct xarray                           frags;
> > > > +               struct net_devmem_dmabuf_binding        *binding;
> > > > +       } sk_devmem_info;
> > > >
> > > >  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > > >         struct module           *sk_owner;
> > > > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > > > index e0b579a1df4f..1e5c209cb998 100644
> > > > --- a/include/uapi/linux/netdev.h
> > > > +++ b/include/uapi/linux/netdev.h
> > > > @@ -207,6 +207,7 @@ enum {
> > > >         NETDEV_A_DMABUF_QUEUES,
> > > >         NETDEV_A_DMABUF_FD,
> > > >         NETDEV_A_DMABUF_ID,
> > > > +       NETDEV_A_DMABUF_AUTORELEASE,
> > > >
> > > >         __NETDEV_A_DMABUF_MAX,
> > > >         NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index 05a9a9e7abb9..05c16df657c7 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -11,6 +11,7 @@
> > > >  #include <linux/genalloc.h>
> > > >  #include <linux/mm.h>
> > > >  #include <linux/netdevice.h>
> > > > +#include <linux/skbuff_ref.h>
> > > >  #include <linux/types.h>
> > > >  #include <net/netdev_queues.h>
> > > >  #include <net/netdev_rx_queue.h>
> > > > @@ -28,6 +29,19 @@
> > > >
> > > >  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > > >
> > > > +/* If the user unbinds before releasing all tokens, the static key must not
> > > > + * change until all tokens have been released (to avoid calling the wrong
> > > > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > > > + * and binding alloc/free atomic with regards to each other, using the
> > > > + * devmem_ar_lock. This works because binding free does not occur until all of
> > > > + * the outstanding token's references on the binding are dropped.
> > > > + */
> > > > +static DEFINE_MUTEX(devmem_ar_lock);
> > > > +
> > > > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > > > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > > > +static int net_devmem_dmabuf_rx_bindings_count;
> > > > +
> > > >  static const struct memory_provider_ops dmabuf_devmem_ops;
> > > >
> > > >  bool net_is_devmem_iov(struct net_iov *niov)
> > > > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > > >
> > > >         size_t size, avail;
> > > >
> > > > +       if (binding->direction == DMA_FROM_DEVICE) {
> > > > +               mutex_lock(&devmem_ar_lock);
> > > > +               net_devmem_dmabuf_rx_bindings_count--;
> > > > +               if (net_devmem_dmabuf_rx_bindings_count == 0)
> > > > +                       static_branch_disable(&tcp_devmem_ar_key);
> > > > +               mutex_unlock(&devmem_ar_lock);
> > > > +       }
> > > > +
> > > 
> > > I find this loging with devmem_ar_lock and
> > > net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> > > can do another simplification here? Can we have it such that the first
> > > binding sets the system in autorelease on or autorelease off mode, and
> > > all future bindings maintain this state? We already don't support
> > > autorelease on/off mix.
> > 
> > I think that would greatly simplify things. We would still need a lock
> > to make the static branch change and first release mode setting atomic WRT
> > each other, but the other parts (like the one above) can be
> > removed.
> 
> I'm not against this, but I wonder how we can test both modes on NIPA?
> If we lock the mode, we can only test one mode until the kernel
> reboots... I wonder whether with your proposed refcnt changes (in the
> other thread) we can keep existing mode (where we don't need a reboot).

Right, with the refcnt changes in the other thread we can keep the
existing. I couldn't find any solid precedent for switching some
per-boot settings in selftests, shy of bpf's spinning up a new VM. Most
tests just skip if they need the other mode, which isn't what we want
here.

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [net-next,v9,3/5] net: devmem: implement autorelease token management
  2026-01-10  2:18 ` [PATCH net-next v9 3/5] net: devmem: implement autorelease token management Bobby Eshleman
  2026-01-11 19:12   ` Mina Almasry
@ 2026-01-13  4:00   ` Jakub Kicinski
  1 sibling, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2026-01-13  4:00 UTC (permalink / raw)
  To: bobbyeshleman
  Cc: Jakub Kicinski, horms, linux-arch, pabeni, willemb, davem,
	linux-doc, donald.hunter, matttbe, linux-kernel, asml.silence,
	arnd, netdev, almasrymina, linux-kselftest, skhawaja,
	bobbyeshleman, ncardwell, andrew+netdev, kuniyu, edumazet, sdf,
	corbet, dsahern, shuah

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least pausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: devmem: implement autorelease token management

This commit adds autorelease mode toggling for devmem tokens using a
static branch. The system is locked into the mode set by the first
binding and can only change when all bindings are released.

> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 05a9a9e7abb9..05c16df657c7 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c

[ ... ]

> @@ -292,25 +333,62 @@ net_devmem_bind_dmabuf(struct net_device *dev,

[ ... ]

> +	mutex_lock(&devmem_ar_lock);
> +
> +	if (direction == DMA_FROM_DEVICE) {
> +		if (net_devmem_dmabuf_rx_bindings_count > 0) {
> +			bool mode;
> +
> +			mode = static_key_enabled(&tcp_devmem_ar_key);
> +
> +			/* When bindings exist, enforce that the mode does not
> +			 * change.
> +			 */
> +			if (mode != autorelease) {
> +				NL_SET_ERR_MSG_FMT(extack,
> +						   "System already configured with autorelease=%d",
> +						   mode);
> +				err = -EINVAL;
> +				goto err_unlock_mutex;
> +			}
> +		} else if (autorelease) {
> +			/* First binding with autorelease enabled sets the
> +			 * mode.  If autorelease is false, the key is already
> +			 * disabled by default so no action is needed.
> +			 */
> +			static_branch_enable(&tcp_devmem_ar_key);
> +		}
> +
> +		net_devmem_dmabuf_rx_bindings_count++;
> +	}
> +
> +	err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> +			      binding, xa_limit_32b, &id_alloc_next,
> +			      GFP_KERNEL);
> +	if (err < 0)
> +		goto err_dec_binding_count;
> +
> +	mutex_unlock(&devmem_ar_lock);
> +
> +	list_add(&binding->list, &priv->bindings);
> +
> +	return binding;
> +
> +err_dec_binding_count:
> +	if (direction == DMA_FROM_DEVICE)
> +		net_devmem_dmabuf_rx_bindings_count--;
                ^^^

If xa_alloc_cyclic() fails after the first binding with autorelease=true,
the static branch tcp_devmem_ar_key remains enabled even though
net_devmem_dmabuf_rx_bindings_count is decremented back to zero. Should
the error path also call static_branch_disable() when the count returns
to zero, similar to what __net_devmem_dmabuf_binding_free() does?

This appears to leave the system in an inconsistent state where a
subsequent binding with autorelease=false would succeed (since count is
zero) but the static branch is still enabled from the failed attempt.
The SO_DEVMEM_DONTNEED handler would then call the wrong function
(autorelease path instead of manual release path).

> +
> +err_unlock_mutex:
> +	mutex_unlock(&devmem_ar_lock);

[ ... ]
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (2 preceding siblings ...)
  2026-01-10  2:18 ` [PATCH net-next v9 3/5] net: devmem: implement autorelease token management Bobby Eshleman
@ 2026-01-10  2:18 ` Bobby Eshleman
  2026-01-11 19:14   ` Mina Almasry
  2026-01-10  2:18 ` [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test Bobby Eshleman
  4 siblings, 1 reply; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Update devmem.rst documentation to describe the autorelease netlink
attribute used during RX dmabuf binding.

The autorelease attribute is specified at bind-time via the netlink API
(NETDEV_CMD_BIND_RX) and controls what happens to outstanding tokens
when the socket closes.

Document the two token release modes (automatic vs manual), how to
configure the binding for autorelease, the perf benefits, new caveats
and restrictions, and the way the mode is enforced system-wide.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v7:
- Document netlink instead of sockopt
- Mention system-wide locked to one mode
---
 Documentation/networking/devmem.rst | 70 +++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a6cd7236bfbd..67c63bc5a7ae 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -235,6 +235,76 @@ can be less than the tokens provided by the user in case of:
 (a) an internal kernel leak bug.
 (b) the user passed more than 1024 frags.
 
+
+Autorelease Control
+~~~~~~~~~~~~~~~~~~~
+
+The autorelease mode controls what happens to outstanding tokens (tokens not
+released via SO_DEVMEM_DONTNEED) when the socket closes. Autorelease is
+configured per-binding at binding creation time via the netlink API::
+
+	struct netdev_bind_rx_req *req;
+	struct netdev_bind_rx_rsp *rsp;
+	struct ynl_sock *ys;
+	struct ynl_error yerr;
+
+	ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+	req = netdev_bind_rx_req_alloc();
+	netdev_bind_rx_req_set_ifindex(req, ifindex);
+	netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+	netdev_bind_rx_req_set_autorelease(req, 0); /* 0 = manual, 1 = auto */
+	__netdev_bind_rx_req_set_queues(req, queues, n_queues);
+
+	rsp = netdev_bind_rx(ys, req);
+
+	dmabuf_id = rsp->id;
+
+When autorelease is disabled (0):
+
+- Outstanding tokens are NOT released when the socket closes
+- Outstanding tokens are only released when the dmabuf is unbound
+- Provides better performance by eliminating xarray overhead (~13% CPU reduction)
+- Kernel tracks tokens via atomic reference counters in net_iov structures
+
+When autorelease is enabled (1):
+
+- Outstanding tokens are automatically released when the socket closes
+- Backwards compatible behavior
+- Kernel tracks tokens in an xarray per socket
+
+The default is autorelease disabled.
+
+Important: In both modes, applications should call SO_DEVMEM_DONTNEED to
+return tokens as soon as they are done processing. The autorelease setting only
+affects what happens to tokens that are still outstanding when close() is called.
+
+The mode is enforced system-wide. Once a binding is created with a specific
+autorelease mode, all subsequent bindings system-wide must use the same mode.
+
+
+Performance Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Disabling autorelease provides approximately ~13% CPU utilization improvement
+in RX workloads. That said, applications must ensure all tokens are released
+via SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages
+will remain pinned until the dmabuf is unbound.
+
+
+Caveats
+~~~~~~~
+
+- Once a system-wide autorelease mode is selected (via the first binding),
+  all subsequent bindings must use the same mode. Attempts to create bindings
+  with a different mode will be rejected with -EINVAL.
+
+- Applications using manual release mode (autorelease=0) must ensure all tokens
+  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
+  leaks during the lifetime of the dmabuf binding. Tokens not released before
+  close() will only be freed when the dmabuf is unbound.
+
+
 TX Interface
 ============
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
  2026-01-10  2:18 ` [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
@ 2026-01-11 19:14   ` Mina Almasry
  0 siblings, 0 replies; 21+ messages in thread
From: Mina Almasry @ 2026-01-11 19:14 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@meta.com>
>
> Update devmem.rst documentation to describe the autorelease netlink
> attribute used during RX dmabuf binding.
>
> The autorelease attribute is specified at bind-time via the netlink API
> (NETDEV_CMD_BIND_RX) and controls what happens to outstanding tokens
> when the socket closes.
>
> Document the two token release modes (automatic vs manual), how to
> configure the binding for autorelease, the perf benefits, new caveats
> and restrictions, and the way the mode is enforced system-wide.
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test
  2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (3 preceding siblings ...)
  2026-01-10  2:18 ` [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
@ 2026-01-10  2:18 ` Bobby Eshleman
  2026-01-11 19:16   ` Mina Almasry
  4 siblings, 1 reply; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Mina Almasry, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Donald Hunter
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, asml.silence, matttbe, skhawaja, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add test case for autorelease.

The test case is the same as the RX test, but enables autorelease.  The
original RX test is changed to use the -a 0 flag to disable autorelease.

TAP version 13
1..4
ok 1 devmem.check_rx
ok 2 devmem.check_rx_autorelease
ok 3 devmem.check_tx
ok 4 devmem.check_tx_chunks

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v8:
- removed stale/missing tests

Changes in v7:
- use autorelease netlink
- remove sockopt tests
---
 tools/testing/selftests/drivers/net/hw/devmem.py  | 21 +++++++++++++++++++--
 tools/testing/selftests/drivers/net/hw/ncdevmem.c | 19 +++++++++++++------
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py
index 45c2d49d55b6..dbe696a445bd 100755
--- a/tools/testing/selftests/drivers/net/hw/devmem.py
+++ b/tools/testing/selftests/drivers/net/hw/devmem.py
@@ -25,7 +25,24 @@ def check_rx(cfg) -> None:
 
     port = rand_port()
     socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
-    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7 -a 0"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_rx_autorelease(cfg) -> None:
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} \
+                  -c {cfg.remote_addr} -v 7 -a 1"
 
     with bkg(listen_cmd, exit_wait=True) as ncdevmem:
         wait_port_listen(port)
@@ -68,7 +85,7 @@ def main() -> None:
         cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem")
         cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
 
-        ksft_run([check_rx, check_tx, check_tx_chunks],
+        ksft_run([check_rx, check_rx_autorelease, check_tx, check_tx_chunks],
                  args=(cfg, ))
     ksft_exit()
 
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 3288ed04ce08..406f1771d9ec 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -92,6 +92,7 @@ static char *port;
 static size_t do_validation;
 static int start_queue = -1;
 static int num_queues = -1;
+static int devmem_autorelease;
 static char *ifname;
 static unsigned int ifindex;
 static unsigned int dmabuf_id;
@@ -679,7 +680,8 @@ static int configure_flow_steering(struct sockaddr_in6 *server_sin)
 
 static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
 			 struct netdev_queue_id *queues,
-			 unsigned int n_queue_index, struct ynl_sock **ys)
+			 unsigned int n_queue_index, struct ynl_sock **ys,
+			 int autorelease)
 {
 	struct netdev_bind_rx_req *req = NULL;
 	struct netdev_bind_rx_rsp *rsp = NULL;
@@ -695,6 +697,7 @@ static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
 	req = netdev_bind_rx_req_alloc();
 	netdev_bind_rx_req_set_ifindex(req, ifindex);
 	netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+	netdev_bind_rx_req_set_autorelease(req, autorelease);
 	__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
 
 	rsp = netdev_bind_rx(*ys, req);
@@ -872,7 +875,8 @@ static int do_server(struct memory_buffer *mem)
 		goto err_reset_rss;
 	}
 
-	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) {
+	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys,
+			  devmem_autorelease)) {
 		pr_err("Failed to bind");
 		goto err_reset_flow_steering;
 	}
@@ -1092,7 +1096,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Binding empty queues array should have failed");
 		goto err_unbind;
 	}
@@ -1108,7 +1112,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (!bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Configure dmabuf with header split off should have failed");
 		goto err_unbind;
 	}
@@ -1124,7 +1128,7 @@ int run_devmem_tests(void)
 		goto err_reset_headersplit;
 	}
 
-	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys)) {
+	if (bind_rx_queue(ifindex, mem->fd, queues, num_queues, &ys, 0)) {
 		pr_err("Failed to bind");
 		goto err_reset_headersplit;
 	}
@@ -1397,7 +1401,7 @@ int main(int argc, char *argv[])
 	int is_server = 0, opt;
 	int ret, err = 1;
 
-	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:")) != -1) {
+	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:a:")) != -1) {
 		switch (opt) {
 		case 'l':
 			is_server = 1;
@@ -1426,6 +1430,9 @@ int main(int argc, char *argv[])
 		case 'z':
 			max_chunk = atoi(optarg);
 			break;
+		case 'a':
+			devmem_autorelease = atoi(optarg);
+			break;
 		case '?':
 			fprintf(stderr, "unknown option: %c\n", optopt);
 			break;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test
  2026-01-10  2:18 ` [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test Bobby Eshleman
@ 2026-01-11 19:16   ` Mina Almasry
  2026-01-12 19:56     ` Bobby Eshleman
  0 siblings, 1 reply; 21+ messages in thread
From: Mina Almasry @ 2026-01-11 19:16 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@meta.com>
>
> Add test case for autorelease.
>
> The test case is the same as the RX test, but enables autorelease.  The
> original RX test is changed to use the -a 0 flag to disable autorelease.
>
> TAP version 13
> 1..4
> ok 1 devmem.check_rx
> ok 2 devmem.check_rx_autorelease
> ok 3 devmem.check_tx
> ok 4 devmem.check_tx_chunks
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Can you add a test for the problematic/weird scenario I comment on patch 3?

1. User does bind (autorelease on or off)
2. Data is received.
3. User does unbind.
4. User calls recevmsg()
5. User calls dontneed on the frags obtained in 4.

This should work with autorelease=on or off, or at least emit a clean
error message (kernel must not splat).

I realize a made a suggestion in patch 3 that may make this hard to
test (i.e. put the kernel in autorelease on/off mode for the boot
session on the first unbind). If we can add a test while making that
simplification great, if not, lets not make the simplification I
guess.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test
  2026-01-11 19:16   ` Mina Almasry
@ 2026-01-12 19:56     ` Bobby Eshleman
  0 siblings, 0 replies; 21+ messages in thread
From: Bobby Eshleman @ 2026-01-12 19:56 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Donald Hunter, Stanislav Fomichev, netdev,
	linux-kernel, linux-arch, linux-doc, linux-kselftest,
	asml.silence, matttbe, skhawaja, Bobby Eshleman

On Sun, Jan 11, 2026 at 11:16:37AM -0800, Mina Almasry wrote:
> On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Add test case for autorelease.
> >
> > The test case is the same as the RX test, but enables autorelease.  The
> > original RX test is changed to use the -a 0 flag to disable autorelease.
> >
> > TAP version 13
> > 1..4
> > ok 1 devmem.check_rx
> > ok 2 devmem.check_rx_autorelease
> > ok 3 devmem.check_tx
> > ok 4 devmem.check_tx_chunks
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Can you add a test for the problematic/weird scenario I comment on patch 3?
> 
> 1. User does bind (autorelease on or off)
> 2. Data is received.
> 3. User does unbind.
> 4. User calls recevmsg()
> 5. User calls dontneed on the frags obtained in 4.
> 
> This should work with autorelease=on or off, or at least emit a clean
> error message (kernel must not splat).

IIUC, this looks something like (psuedo-code):

ncdevmem.c:

do_server(...) {

	client_fd = accept(...);

	if (check_premature_unbind) {
		/* wait for data but don't recvmsg yet */
		epoll(client_fd, ...);

		/* unbind */
		ynl_sock_destroy(ys);
		
		while (1) {
			ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM);
			/* check ret */

			ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, ...)
			/* check ret */
		}
	} else { ... }
}

... then devmem.py checks dmesg?

> 
> I realize a made a suggestion in patch 3 that may make this hard to
> test (i.e. put the kernel in autorelease on/off mode for the boot
> session on the first unbind). If we can add a test while making that
> simplification great, if not, lets not make the simplification I
> guess.

I think we can do both the simplification and this test, but in general
we would have to skip any test when rx bind fails due to the test's new
mode not matching. Not sure if that is desired.

I tend to like the simplification because I really dislike having to
track the RX binding count, but I'm not sure if there is a good way to
do that with making our tests locked into a single mode.

Maybe a debugfs reset option that rejects when rx_bindings_count is
non-zero? That way we can remove all the program logic around
rx_bindings_count and make it's inc/dec wrapper functions no-ops in
production (CONFIG_DEBUG_NET_DEVMEM=n), but still test both modes?


The handler would look something like (approx.):

#ifdef CONFIG_DEBUG_NET_DEVMEM
static ssize_t devmem_reset_write(struct file *file, const char __user *buf,
				  size_t count, loff_t *ppos)
{
	int ret = count;

	mutex_lock(&devmem_ar_lock);

	if (net_devmem_rx_bindings_count_read() != 0) {
		ret = -EBUSY;
		goto unlock;
	}

	/* enable setting the key again via bind_rx) */
	tcp_devmem_ar_locked = false;

	static_branch_disable(&tcp_devmem_ar_key);

unlock:
	mutex_unlock(&devmem_ar_lock);
	return ret;
}
[...]
#endif


... but I couldn't find a good precedent for this in the current
selftests. 

Best,
Bobby

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-01-14 21:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-10  2:18 [PATCH net-next v9 0/5] net: devmem: improve cpu cost of RX token management Bobby Eshleman
2026-01-10  2:18 ` [PATCH net-next v9 1/5] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
2026-01-10  2:18 ` [PATCH net-next v9 2/5] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
2026-01-10  2:18 ` [PATCH net-next v9 3/5] net: devmem: implement autorelease token management Bobby Eshleman
2026-01-11 19:12   ` Mina Almasry
2026-01-11 19:27     ` Mina Almasry
2026-01-12 17:42       ` Bobby Eshleman
2026-01-12 16:24     ` Bobby Eshleman
2026-01-13 19:27       ` Mina Almasry
2026-01-13 20:32         ` Bobby Eshleman
2026-01-13 20:42           ` Mina Almasry
2026-01-14  3:18             ` Bobby Eshleman
2026-01-14 17:04             ` Bobby Eshleman
2026-01-14 20:54       ` Stanislav Fomichev
2026-01-14 21:25         ` Bobby Eshleman
2026-01-13  4:00   ` [net-next,v9,3/5] " Jakub Kicinski
2026-01-10  2:18 ` [PATCH net-next v9 4/5] net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute Bobby Eshleman
2026-01-11 19:14   ` Mina Almasry
2026-01-10  2:18 ` [PATCH net-next v9 5/5] selftests: drv-net: devmem: add autorelease test Bobby Eshleman
2026-01-11 19:16   ` Mina Almasry
2026-01-12 19:56     ` Bobby Eshleman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox