linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management
@ 2025-11-05  1:23 Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 1/6] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

This series improves the CPU cost of RX token management by adding a
socket option that configures the socket to avoid the xarray allocator
and instead use an niov array and a uref field in niov.

Improvement is ~13% cpu util per RX user thread.
    
Using kperf, the following results were observed:

Before:
	Average RX worker idle %: 13.13, flows 4, test runs 11
After:
	Average RX worker idle %: 26.32, flows 4, test runs 11

Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.

The sockopt SO_DEVMEM_AUTORELEASE is added to toggle the optimization.
It defaults to 0 (i.e., optimization on).

Note that prior revs reported only a 5% gain. This lower gain was
measured with cpu frequency boosting (unknowingly) disabled. A
consistent ~13% is measured for both kperf and nccl workloads with cpu
frequency boosting on.

To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Simon Horman <horms@kernel.org>
To: Kuniyuki Iwashima <kuniyu@google.com>
To: Willem de Bruijn <willemb@google.com>
To: Neal Cardwell <ncardwell@google.com>
To: David Ahern <dsahern@kernel.org>
To: Mina Almasry <almasrymina@google.com>
To: Arnd Bergmann <arnd@arndb.de>
To: Jonathan Corbet <corbet@lwn.net>
To: Andrew Lunn <andrew+netdev@lunn.ch>
To: Shuah Khan <shuah@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
  optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
  field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-upstream-v5-0-47cb85f5259e@meta.com

Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com

Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v3-0-084b46bda88f@meta.com

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket
  leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com

Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
  (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com

---
Bobby Eshleman (6):
      net: devmem: rename tx_vec to vec in dmabuf binding
      net: devmem: refactor sock_devmem_dontneed for autorelease split
      net: devmem: prepare for autorelease rx token management
      net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control
      net: devmem: document SO_DEVMEM_AUTORELEASE socket option
      net: devmem: add tests for SO_DEVMEM_AUTORELEASE socket option

 Documentation/networking/devmem.rst               |  70 +++++++++-
 include/net/netmem.h                              |   1 +
 include/net/sock.h                                |  13 +-
 include/uapi/asm-generic/socket.h                 |   2 +
 net/core/devmem.c                                 |  54 +++++---
 net/core/devmem.h                                 |   4 +-
 net/core/sock.c                                   | 152 ++++++++++++++++++----
 net/ipv4/tcp.c                                    |  69 ++++++++--
 net/ipv4/tcp_ipv4.c                               |  11 +-
 net/ipv4/tcp_minisocks.c                          |   5 +-
 tools/include/uapi/asm-generic/socket.h           |   2 +
 tools/testing/selftests/drivers/net/hw/devmem.py  | 115 +++++++++++++++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c |  20 ++-
 13 files changed, 453 insertions(+), 65 deletions(-)
---
base-commit: 255d75ef029f33f75fcf5015052b7302486f7ad2
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 1/6] net: devmem: rename tx_vec to vec in dmabuf binding
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 2/6] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'.
This field holds pointers to net_iov structures. The rename prepares for
reusing 'vec' for both TX and RX directions.

No functional change intended.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/devmem.c | 22 +++++++++++-----------
 net/core/devmem.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 1d04754bc756..4dee2666dd07 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -75,7 +75,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 	dma_buf_detach(binding->dmabuf, binding->attachment);
 	dma_buf_put(binding->dmabuf);
 	xa_destroy(&binding->bound_rxqs);
-	kvfree(binding->tx_vec);
+	kvfree(binding->vec);
 	kfree(binding);
 }
 
@@ -232,10 +232,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	}
 
 	if (direction == DMA_TO_DEVICE) {
-		binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-						 sizeof(struct net_iov *),
-						 GFP_KERNEL);
-		if (!binding->tx_vec) {
+		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+					      sizeof(struct net_iov *),
+					      GFP_KERNEL);
+		if (!binding->vec) {
 			err = -ENOMEM;
 			goto err_unmap;
 		}
@@ -249,7 +249,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 					      dev_to_node(&dev->dev));
 	if (!binding->chunk_pool) {
 		err = -ENOMEM;
-		goto err_tx_vec;
+		goto err_vec;
 	}
 
 	virtual = 0;
@@ -295,7 +295,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
 			if (direction == DMA_TO_DEVICE)
-				binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
@@ -315,8 +315,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	gen_pool_for_each_chunk(binding->chunk_pool,
 				net_devmem_dmabuf_free_chunk_owner, NULL);
 	gen_pool_destroy(binding->chunk_pool);
-err_tx_vec:
-	kvfree(binding->tx_vec);
+err_vec:
+	kvfree(binding->vec);
 err_unmap:
 	dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt,
 					  direction);
@@ -363,7 +363,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
 	int err = 0;
 
 	binding = net_devmem_lookup_dmabuf(dmabuf_id);
-	if (!binding || !binding->tx_vec) {
+	if (!binding || !binding->vec) {
 		err = -EINVAL;
 		goto out_err;
 	}
@@ -414,7 +414,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding,
 	*off = virt_addr % PAGE_SIZE;
 	*size = PAGE_SIZE - *off;
 
-	return binding->tx_vec[virt_addr / PAGE_SIZE];
+	return binding->vec[virt_addr / PAGE_SIZE];
 }
 
 /*** "Dmabuf devmem memory provider" ***/
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 101150d761af..2ada54fb63d7 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding {
 	 * address. This array is convenient to map the virtual addresses to
 	 * net_iovs in the TX path.
 	 */
-	struct net_iov **tx_vec;
+	struct net_iov **vec;
 
 	struct work_struct unbind_w;
 };

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 2/6] net: devmem: refactor sock_devmem_dontneed for autorelease split
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 1/6] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Refactor sock_devmem_dontneed() in preparation for supporting both
autorelease and manual token release modes.

Split the function into two parts:
- sock_devmem_dontneed(): handles input validation, token allocation,
  and copying from userspace
- sock_devmem_dontneed_autorelease(): performs the actual token release
  via xarray lookup and page pool put

This separation allows a future commit to add a parallel
sock_devmem_dontneed_manual_release() function that uses a different
token tracking mechanism (per-niov reference counting) without
duplicating the input validation logic.

The refactoring is purely mechanical with no functional change. Only
intended to minimize the noise in subsequent patches.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 net/core/sock.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 7a9bbc2afcf0..5562f517d889 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1082,30 +1082,13 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_FRAGS 1024
 
 static noinline_for_stack int
-sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
+				 unsigned int num_tokens)
 {
-	unsigned int num_tokens, i, j, k, netmem_num = 0;
-	struct dmabuf_token *tokens;
+	unsigned int i, j, k, netmem_num = 0;
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	if (!sk_is_tcp(sk))
-		return -EBADF;
-
-	if (optlen % sizeof(*tokens) ||
-	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
-		return -EINVAL;
-
-	num_tokens = optlen / sizeof(*tokens);
-	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
-	if (!tokens)
-		return -ENOMEM;
-
-	if (copy_from_sockptr(tokens, optval, optlen)) {
-		kvfree(tokens);
-		return -EFAULT;
-	}
-
 	xa_lock_bh(&sk->sk_user_frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
@@ -1135,6 +1118,35 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
+	return ret;
+}
+
+static noinline_for_stack int
+sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
+{
+	struct dmabuf_token *tokens;
+	unsigned int num_tokens;
+	int ret;
+
+	if (!sk_is_tcp(sk))
+		return -EBADF;
+
+	if (optlen % sizeof(*tokens) ||
+	    optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
+		return -EINVAL;
+
+	num_tokens = optlen / sizeof(*tokens);
+	tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
+	if (!tokens)
+		return -ENOMEM;
+
+	if (copy_from_sockptr(tokens, optval, optlen)) {
+		kvfree(tokens);
+		return -EFAULT;
+	}
+
+	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+
 	kvfree(tokens);
 	return ret;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 1/6] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 2/6] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  2025-11-05 16:02   ` kernel test robot
                     ` (3 more replies)
  2025-11-05  1:23 ` [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control Bobby Eshleman
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add alternative token management implementation (autorelease vs
non-autorelease) that replaces xarray-based token lookups with direct
array access using page offsets as dmabuf tokens. When enabled, this
eliminates xarray overhead and reduces CPU utilization in devmem RX
threads by approximately 13%.

This patch changes the meaning of tokens when this option is used.
Tokens previously referred to unique fragments of pages. With this
option tokens instead represent references to pages, not fragments.
Because of this, multiple tokens may refer to the same page and so have
identical value (e.g., two small fragments may coexist on the same
page). The token and offset pair that the user receives uniquely
identifies fragments if needed.  This assumes that the user is not
attempting to sort / uniq the token list using tokens alone.

This introduces a restriction: devmem RX sockets cannot switch dmabuf
bindings when using the autorelease off option. This is necessary
because 32-bit tokens lack sufficient bits to encode both large dmabuf
page counts and binding/queue IDs. For example, a system with 8 NICs and
32 queues needs 8 bits for binding IDs, leaving only 24 bits for pages
(64GB max). This restriction aligns with common usage, as steering flows
to different queues/devices is often undesirable for TCP.

This patch adds an atomic uref counter to net_iov for tracking user
references via binding->vec. The pp_ref_count is only updated on uref
transitions from zero to one or from one to zero, to minimize atomic
overhead. If a user fails to refill and closes before returning all
tokens, the binding will finish the uref release when unbound.

A flag "autorelease" is added per-socket. This will be used for enabling
the old behavior of the kernel releasing references for the sockets upon
close(2) (autorelease), instead of requiring that socket users do this
themselves. The autorelease flag is always true in this patch, meaning
that the old (non-optimized) behavior is kept unconditionally. A future
patch supports a user-facing knob to toggle this feature and will change
the default to false for the improved performance.

An outstanding_urefs counter is added per-socket so that changes to the
autorelease mode can be rejected for active sockets.

The dmabuf unbind path always checks for any leaked urefs.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v6:
- remove sk_devmem_info.autorelease, using binding->autorelease instead
- move binding->autorelease check to outside of
  net_devmem_dmabuf_binding_put_urefs() (Mina)
- remove overly defensive net_is_devmem_iov() (Mina)
- add comment about multiple urefs mapping to a single netmem ref (Mina)
- remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
- use niov without casting back and forth with netmem (Mina)
- move the autorelease flag from per-binding to per-socket (Mina)
- remove the batching logic in sock_devmem_dontneed_manual_release()
  (Mina)
- move autorelease check inside tcp_xa_pool_commit() (Mina)
- remove single-binding restriction for autorelease mode (Mina)
- unbind always checks for leaked urefs

Changes in v5:
- remove unused variables
- introduce autorelease flag, preparing for future patch toggle new
  behavior

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket leaked
  tokens
- drop ethtool patch

Changes in v2:
- always use GFP_ZERO for binding->vec (Mina)
- remove WARN for changed binding (Mina)
- remove extraneous binding ref get (Mina)
- remove WARNs on invalid user input (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- fix length of alloc for urefs
---
 include/net/netmem.h     |  1 +
 include/net/sock.h       | 13 +++++++--
 net/core/devmem.c        | 42 ++++++++++++++++++++++-------
 net/core/devmem.h        |  2 +-
 net/core/sock.c          | 55 +++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp.c           | 69 ++++++++++++++++++++++++++++++++++++++----------
 net/ipv4/tcp_ipv4.c      | 11 +++++---
 net/ipv4/tcp_minisocks.c |  5 +++-
 8 files changed, 161 insertions(+), 37 deletions(-)

diff --git a/include/net/netmem.h b/include/net/netmem.h
index 9e10f4ac50c3..80d2263ba4ed 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -112,6 +112,7 @@ struct net_iov {
 	};
 	struct net_iov_area *owner;
 	enum net_iov_type type;
+	atomic_t uref;
 };
 
 struct net_iov_area {
diff --git a/include/net/sock.h b/include/net/sock.h
index c7e58b8e8a90..548fabacff7c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -350,7 +350,11 @@ struct sk_filter;
   *	@sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
   *	@sk_scm_unused: unused flags for scm_recv()
   *	@ns_tracker: tracker for netns reference
-  *	@sk_user_frags: xarray of pages the user is holding a reference on.
+  *	@sk_devmem_info: the devmem binding information for the socket
+  *		@frags: xarray of tokens for autorelease mode
+  *		@binding: pointer to the dmabuf binding
+  *		@outstanding_urefs: count of outstanding user references
+  *		@autorelease: if true, tokens released on close; if false, user must release
   *	@sk_owner: reference to the real owner of the socket that calls
   *		   sock_lock_init_class_and_name().
   */
@@ -579,7 +583,12 @@ struct sock {
 	struct numa_drop_counters *sk_drop_counters;
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
-	struct xarray		sk_user_frags;
+	struct {
+		struct xarray				frags;
+		struct net_devmem_dmabuf_binding	*binding;
+		atomic_t				outstanding_urefs;
+		bool					autorelease;
+	} sk_devmem_info;
 
 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 	struct module		*sk_owner;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 4dee2666dd07..904d19e58f4b 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -11,6 +11,7 @@
 #include <linux/genalloc.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
+#include <linux/skbuff_ref.h>
 #include <linux/types.h>
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
@@ -116,6 +117,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
 	gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
 }
 
+static void
+net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
+{
+	int i;
+
+	for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
+		struct net_iov *niov;
+		netmem_ref netmem;
+
+		niov = binding->vec[i];
+		netmem = net_iov_to_netmem(niov);
+
+		/* Multiple urefs map to only a single netmem ref. */
+		if (atomic_xchg(&niov->uref, 0) > 0)
+			WARN_ON_ONCE(!napi_pp_put_page(netmem));
+	}
+}
+
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 {
 	struct netdev_rx_queue *rxq;
@@ -143,6 +162,10 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		__net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
 	}
 
+	/* Clean up any lingering urefs from sockets that had autorelease
+	 * disabled.
+	 */
+	net_devmem_dmabuf_binding_put_urefs(binding);
 	net_devmem_dmabuf_binding_put(binding);
 }
 
@@ -231,14 +254,13 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 		goto err_detach;
 	}
 
-	if (direction == DMA_TO_DEVICE) {
-		binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
-					      sizeof(struct net_iov *),
-					      GFP_KERNEL);
-		if (!binding->vec) {
-			err = -ENOMEM;
-			goto err_unmap;
-		}
+	/* Used by TX and also by RX when socket has autorelease disabled */
+	binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+				      sizeof(struct net_iov *),
+				      GFP_KERNEL | __GFP_ZERO);
+	if (!binding->vec) {
+		err = -ENOMEM;
+		goto err_unmap;
 	}
 
 	/* For simplicity we expect to make PAGE_SIZE allocations, but the
@@ -292,10 +314,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 			niov = &owner->area.niovs[i];
 			niov->type = NET_IOV_DMABUF;
 			niov->owner = &owner->area;
+			atomic_set(&niov->uref, 0);
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
-			if (direction == DMA_TO_DEVICE)
-				binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+			binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 2ada54fb63d7..d4eb28d079bb 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -61,7 +61,7 @@ struct net_devmem_dmabuf_binding {
 
 	/* Array of net_iov pointers for this binding, sorted by virtual
 	 * address. This array is convenient to map the virtual addresses to
-	 * net_iovs in the TX path.
+	 * net_iovs.
 	 */
 	struct net_iov **vec;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 5562f517d889..465645c1d74f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -87,6 +87,7 @@
 
 #include <linux/unaligned.h>
 #include <linux/capability.h>
+#include <linux/dma-buf.h>
 #include <linux/errno.h>
 #include <linux/errqueue.h>
 #include <linux/types.h>
@@ -151,6 +152,7 @@
 #include <uapi/linux/pidfd.h>
 
 #include "dev.h"
+#include "devmem.h"
 
 static DEFINE_MUTEX(proto_list_mutex);
 static LIST_HEAD(proto_list);
@@ -1081,6 +1083,43 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 #define MAX_DONTNEED_TOKENS 128
 #define MAX_DONTNEED_FRAGS 1024
 
+static noinline_for_stack int
+sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens,
+				    unsigned int num_tokens)
+{
+	struct net_iov *niov;
+	unsigned int i, j;
+	netmem_ref netmem;
+	unsigned int token;
+	int num_frags = 0;
+	int ret;
+
+	if (!sk->sk_devmem_info.binding)
+		return -EINVAL;
+
+	for (i = 0; i < num_tokens; i++) {
+		for (j = 0; j < tokens[i].token_count; j++) {
+			token = tokens[i].token_start + j;
+			if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE)
+				break;
+
+			if (++num_frags > MAX_DONTNEED_FRAGS)
+				return ret;
+
+			niov = sk->sk_devmem_info.binding->vec[token];
+			if (atomic_dec_and_test(&niov->uref)) {
+				netmem = net_iov_to_netmem(niov);
+				WARN_ON_ONCE(!napi_pp_put_page(netmem));
+			}
+			ret++;
+		}
+	}
+
+	atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs);
+
+	return ret;
+}
+
 static noinline_for_stack int
 sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 				 unsigned int num_tokens)
@@ -1089,32 +1128,32 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 	int ret = 0, num_frags = 0;
 	netmem_ref netmems[16];
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 	for (i = 0; i < num_tokens; i++) {
 		for (j = 0; j < tokens[i].token_count; j++) {
 			if (++num_frags > MAX_DONTNEED_FRAGS)
 				goto frag_limit_reached;
 
 			netmem_ref netmem = (__force netmem_ref)__xa_erase(
-				&sk->sk_user_frags, tokens[i].token_start + j);
+				&sk->sk_devmem_info.frags, tokens[i].token_start + j);
 
 			if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
 				continue;
 
 			netmems[netmem_num++] = netmem;
 			if (netmem_num == ARRAY_SIZE(netmems)) {
-				xa_unlock_bh(&sk->sk_user_frags);
+				xa_unlock_bh(&sk->sk_devmem_info.frags);
 				for (k = 0; k < netmem_num; k++)
 					WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 				netmem_num = 0;
-				xa_lock_bh(&sk->sk_user_frags);
+				xa_lock_bh(&sk->sk_devmem_info.frags);
 			}
 			ret++;
 		}
 	}
 
 frag_limit_reached:
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 	for (k = 0; k < netmem_num; k++)
 		WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
 
@@ -1145,7 +1184,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 		return -EFAULT;
 	}
 
-	ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	if (sk->sk_devmem_info.autorelease)
+		ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
+	else
+		ret = sock_devmem_dontneed_manual_release(sk, tokens,
+							  num_tokens);
 
 	kvfree(tokens);
 	return ret;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a9345aa5a2e5..052875c1b547 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -260,6 +260,7 @@
 #include <linux/memblock.h>
 #include <linux/highmem.h>
 #include <linux/cache.h>
+#include <linux/dma-buf.h>
 #include <linux/err.h>
 #include <linux/time.h>
 #include <linux/slab.h>
@@ -492,7 +493,10 @@ void tcp_init_sock(struct sock *sk)
 
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	sk_sockets_allocated_inc(sk);
-	xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	sk->sk_devmem_info.binding = NULL;
+	atomic_set(&sk->sk_devmem_info.outstanding_urefs, 0);
+	sk->sk_devmem_info.autorelease = true;
 }
 EXPORT_IPV6_MOD(tcp_init_sock);
 
@@ -2424,11 +2428,11 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 	/* Commit part that has been copied to user space. */
 	for (i = 0; i < p->idx; i++)
-		__xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
+		__xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i], XA_ZERO_ENTRY,
 			     (__force void *)p->netmems[i], GFP_KERNEL);
 	/* Rollback what has been pre-allocated and is no longer needed. */
 	for (; i < p->max; i++)
-		__xa_erase(&sk->sk_user_frags, p->tokens[i]);
+		__xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
 
 	p->max = 0;
 	p->idx = 0;
@@ -2436,14 +2440,17 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
 
 static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
 {
+	if (!sk->sk_devmem_info.autorelease)
+		return;
+
 	if (!p->max)
 		return;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 }
 
 static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
@@ -2454,18 +2461,18 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
 	if (p->idx < p->max)
 		return 0;
 
-	xa_lock_bh(&sk->sk_user_frags);
+	xa_lock_bh(&sk->sk_devmem_info.frags);
 
 	tcp_xa_pool_commit_locked(sk, p);
 
 	for (k = 0; k < max_frags; k++) {
-		err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
+		err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
 				 XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
 		if (err)
 			break;
 	}
 
-	xa_unlock_bh(&sk->sk_user_frags);
+	xa_unlock_bh(&sk->sk_devmem_info.frags);
 
 	p->max = k;
 	p->idx = 0;
@@ -2479,12 +2486,14 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 			      unsigned int offset, struct msghdr *msg,
 			      int remaining_len)
 {
+	struct net_devmem_dmabuf_binding *binding = NULL;
 	struct dmabuf_cmsg dmabuf_cmsg = { 0 };
 	struct tcp_xa_pool tcp_xa_pool;
 	unsigned int start;
 	int i, copy, n;
 	int sent = 0;
 	int err = 0;
+	int refs;
 
 	tcp_xa_pool.max = 0;
 	tcp_xa_pool.idx = 0;
@@ -2536,6 +2545,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 			struct net_iov *niov;
 			u64 frag_offset;
+			u32 token;
 			int end;
 
 			/* !skb_frags_readable() should indicate that ALL the
@@ -2568,13 +2578,32 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 					      start;
 				dmabuf_cmsg.frag_offset = frag_offset;
 				dmabuf_cmsg.frag_size = copy;
-				err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
-							 skb_shinfo(skb)->nr_frags - i);
-				if (err)
+
+				binding = net_devmem_iov_binding(niov);
+
+				if (!sk->sk_devmem_info.binding)
+					sk->sk_devmem_info.binding = binding;
+
+				if (sk->sk_devmem_info.binding != binding) {
+					err = -EFAULT;
 					goto out;
+				}
+
+				if (sk->sk_devmem_info.autorelease) {
+					err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
+								 skb_shinfo(skb)->nr_frags - i);
+					if (err)
+						goto out;
+
+					dmabuf_cmsg.frag_token =
+						tcp_xa_pool.tokens[tcp_xa_pool.idx];
+				} else {
+					token = net_iov_virtual_addr(niov) >> PAGE_SHIFT;
+					dmabuf_cmsg.frag_token = token;
+				}
+
 
 				/* Will perform the exchange later */
-				dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
 				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
 
 				offset += copy;
@@ -2587,8 +2616,15 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 				if (err)
 					goto out;
 
-				atomic_long_inc(&niov->pp_ref_count);
-				tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
+				if (sk->sk_devmem_info.autorelease) {
+					atomic_long_inc(&niov->pp_ref_count);
+					tcp_xa_pool.netmems[tcp_xa_pool.idx++] =
+						skb_frag_netmem(frag);
+				} else {
+					if (atomic_inc_return(&niov->uref) == 1)
+						atomic_long_inc(&niov->pp_ref_count);
+					refs++;
+				}
 
 				sent += copy;
 
@@ -2599,6 +2635,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 		}
 
 		tcp_xa_pool_commit(sk, &tcp_xa_pool);
+
 		if (!remaining_len)
 			goto out;
 
@@ -2617,9 +2654,13 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 
 out:
 	tcp_xa_pool_commit(sk, &tcp_xa_pool);
+
 	if (!sent)
 		sent = err;
 
+	if (refs > 0)
+		atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs);
+
 	return sent;
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 40a76da5364a..dbb7a71e3cce 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -89,6 +89,9 @@
 
 #include <crypto/md5.h>
 
+#include <linux/dma-buf.h>
+#include "../core/devmem.h"
+
 #include <trace/events/tcp.h>
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -2493,7 +2496,7 @@ static void tcp_release_user_frags(struct sock *sk)
 	unsigned long index;
 	void *netmem;
 
-	xa_for_each(&sk->sk_user_frags, index, netmem)
+	xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
 		WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
 #endif
 }
@@ -2502,9 +2505,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	tcp_release_user_frags(sk);
+	if (sk->sk_devmem_info.binding && sk->sk_devmem_info.autorelease)
+		tcp_release_user_frags(sk);
 
-	xa_destroy(&sk->sk_user_frags);
+	xa_destroy(&sk->sk_devmem_info.frags);
+	sk->sk_devmem_info.binding = NULL;
 
 	trace_tcp_destroy_sock(sk);
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ded2cf1f6006..a017dea35bb8 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -663,7 +663,10 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 
-	xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
+	xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
+	newsk->sk_devmem_info.binding = NULL;
+	atomic_set(&newsk->sk_devmem_info.outstanding_urefs, 0);
+	newsk->sk_devmem_info.autorelease = true;
 
 	return newsk;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (2 preceding siblings ...)
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  2025-11-05 17:16   ` kernel test robot
  2025-11-05  1:23 ` [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option Bobby Eshleman
  2025-11-05  1:23 ` [PATCH net-next v6 6/6] net: devmem: add tests for " Bobby Eshleman
  5 siblings, 1 reply; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add SO_DEVMEM_AUTORELEASE socket option to allow applications to
control token release behavior on a per-socket basis.

The socket option accepts boolean values (0 or 1):
- 1 (true): outstanding tokens are automatically released when the
  socket closes
- 0 (false): outstanding tokens are released when the dmabuf is unbound

The option can only be changed when the socket has no outstanding
tokens, enforced by checking:
1. The frags xarray is empty (no tokens in autorelease mode)
2. The outstanding_urefs counter is zero (no tokens in manual mode)

This restriction prevents inconsistent token tracking state between
acquisition and release calls. If either condition fails, setsockopt
returns -EBUSY.

The default state is autorelease off.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 include/uapi/asm-generic/socket.h       |  2 ++
 net/core/sock.c                         | 51 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c                          |  2 +-
 tools/include/uapi/asm-generic/socket.h |  2 ++
 4 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 53b5a8c002b1..59302318bb34 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -150,6 +150,8 @@
 #define SO_INQ			84
 #define SCM_INQ			SO_INQ
 
+#define SO_DEVMEM_AUTORELEASE	85
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/net/core/sock.c b/net/core/sock.c
index 465645c1d74f..27af476f3cd3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1160,6 +1160,46 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
 	return ret;
 }
 
+static noinline_for_stack int
+sock_devmem_set_autorelease(struct sock *sk, sockptr_t optval, unsigned int optlen)
+{
+	int val;
+
+	if (!sk_is_tcp(sk))
+		return -EBADF;
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+
+	if (copy_from_sockptr(&val, optval, sizeof(val)))
+		return -EFAULT;
+
+	/* Validate that val is 0 or 1 */
+	if (val != 0 && val != 1)
+		return -EINVAL;
+
+	sockopt_lock_sock(sk);
+
+	/* Can only change autorelease if:
+	 * 1. No tokens in the frags xarray (autorelease mode)
+	 * 2. No outstanding urefs (manual release mode)
+	 */
+	if (!xa_empty(&sk->sk_devmem_info.frags)) {
+		sockopt_release_sock(sk);
+		return -EBUSY;
+	}
+
+	if (atomic_read(&sk->sk_devmem_info.outstanding_urefs) > 0) {
+		sockopt_release_sock(sk);
+		return -EBUSY;
+	}
+
+	sk->sk_devmem_info.autorelease = !!val;
+
+	sockopt_release_sock(sk);
+	return 0;
+}
+
 static noinline_for_stack int
 sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
 {
@@ -1351,6 +1391,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
 #ifdef CONFIG_PAGE_POOL
 	case SO_DEVMEM_DONTNEED:
 		return sock_devmem_dontneed(sk, optval, optlen);
+
+	case SO_DEVMEM_AUTORELEASE:
+		return sock_devmem_set_autorelease(sk, optval, optlen);
 #endif
 	case SO_SNDTIMEO_OLD:
 	case SO_SNDTIMEO_NEW:
@@ -2208,6 +2251,14 @@ int sk_getsockopt(struct sock *sk, int level, int optname,
 		v.val = READ_ONCE(sk->sk_txrehash);
 		break;
 
+#ifdef CONFIG_PAGE_POOL
+	case SO_DEVMEM_AUTORELEASE:
+		if (!sk_is_tcp(sk))
+			return -EBADF;
+		v.val = sk->sk_devmem_info.autorelease;
+		break;
+#endif
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 052875c1b547..8226ba892b36 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -496,7 +496,7 @@ void tcp_init_sock(struct sock *sk)
 	xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
 	sk->sk_devmem_info.binding = NULL;
 	atomic_set(&sk->sk_devmem_info.outstanding_urefs, 0);
-	sk->sk_devmem_info.autorelease = true;
+	sk->sk_devmem_info.autorelease = false;
 }
 EXPORT_IPV6_MOD(tcp_init_sock);
 
diff --git a/tools/include/uapi/asm-generic/socket.h b/tools/include/uapi/asm-generic/socket.h
index f333a0ac4ee4..9710a3d7cc4d 100644
--- a/tools/include/uapi/asm-generic/socket.h
+++ b/tools/include/uapi/asm-generic/socket.h
@@ -147,6 +147,8 @@
 
 #define SO_PASSRIGHTS		83
 
+#define SO_DEVMEM_AUTORELEASE	85
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (3 preceding siblings ...)
  2025-11-05  1:23 ` [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  2025-11-05 17:34   ` Stanislav Fomichev
  2025-11-05  1:23 ` [PATCH net-next v6 6/6] net: devmem: add tests for " Bobby Eshleman
  5 siblings, 1 reply; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Update devmem.rst documentation to describe the new SO_DEVMEM_AUTORELEASE
socket option and its usage.

Document the following:
- The two token release modes (automatic vs manual)
- How to use SO_DEVMEM_AUTORELEASE to control the behavior
- Performance benefits of disabling autorelease (~10% CPU reduction)
- Restrictions and caveats of manual token release
- Usage examples for both getsockopt and setsockopt

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 Documentation/networking/devmem.rst | 70 +++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a6cd7236bfbd..1bfce686dce6 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -215,8 +215,8 @@ Freeing frags
 -------------
 
 Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
-processes the frag. The user must return the frag to the kernel via
-SO_DEVMEM_DONTNEED::
+processes the frag. Users should return tokens to the kernel via
+SO_DEVMEM_DONTNEED when they are done processing the data::
 
 	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
 			 sizeof(token));
@@ -235,6 +235,72 @@ can be less than the tokens provided by the user in case of:
 (a) an internal kernel leak bug.
 (b) the user passed more than 1024 frags.
 
+
+Autorelease Control
+~~~~~~~~~~~~~~~~~~~
+
+The SO_DEVMEM_AUTORELEASE socket option controls what happens to outstanding
+tokens (tokens not released via SO_DEVMEM_DONTNEED) when the socket closes::
+
+	int autorelease = 0;  /* 0 = manual release, 1 = automatic release */
+	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE,
+			 &autorelease, sizeof(autorelease));
+
+	/* Query current setting */
+	int current_val;
+	socklen_t len = sizeof(current_val);
+	ret = getsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE,
+			 &current_val, &len);
+
+When autorelease is disabled (default):
+
+- Outstanding tokens are NOT released when the socket closes
+- Outstanding tokens are only released when the dmabuf is unbound
+- Provides better performance by eliminating xarray overhead (~10% CPU reduction)
+- Kernel tracks tokens via atomic reference counters in net_iov structures
+
+When autorelease is enabled:
+
+- Outstanding tokens are automatically released when the socket closes
+- Backwards compatible behavior
+- Kernel tracks tokens in an xarray per socket
+
+Important: In both modes, applications should call SO_DEVMEM_DONTNEED to
+return tokens as soon as they are done processing. The autorelease setting only
+affects what happens to tokens that are still outstanding when close() is called.
+
+The autorelease setting can only be changed when the socket has no outstanding
+tokens. If tokens are present, setsockopt returns -EBUSY.
+
+
+Performance Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Disabling autorelease provides approximately ~10% CPU utilization improvement in
+RX workloads by:
+
+- Eliminating xarray allocations and lookups for token tracking
+- Using atomic reference counters instead
+- Reducing lock contention on the xarray spinlock
+
+However, applications must ensure all tokens are released via
+SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages will
+remain pinned until the dmabuf is unbound.
+
+
+Caveats
+~~~~~~~
+
+- With autorelease disabled, sockets cannot switch between different dmabuf
+  bindings. This restriction exists because tokens in this mode do not encode
+  the binding information necessary to perform the token release.
+
+- Applications using manual release mode (autorelease=0) must ensure all tokens
+  are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
+  leaks during the lifetime of the dmabuf binding. Tokens not released before
+  close() will only be freed when the dmabuf is unbound.
+
+
 TX Interface
 ============
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v6 6/6] net: devmem: add tests for SO_DEVMEM_AUTORELEASE socket option
  2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
                   ` (4 preceding siblings ...)
  2025-11-05  1:23 ` [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option Bobby Eshleman
@ 2025-11-05  1:23 ` Bobby Eshleman
  5 siblings, 0 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05  1:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-doc, linux-kselftest,
	Stanislav Fomichev, Bobby Eshleman

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add -A flag to ncdevmem to set autorelease mode.

Add tests for the SO_DEVMEM_AUTORELEASE socket option:

New tests include:
   - check_sockopt_autorelease_default: Verifies default value is 0
   - check_sockopt_autorelease_set_0: Tests setting to 0 and reading
     back
   - check_sockopt_autorelease_set_1: Tests toggling from 0 to 1
   - check_sockopt_autorelease_invalid: Tests invalid value (2) returns
     EINVAL
   - check_autorelease_disabled: Tests ncdevmem in manual token release
     mode
   - check_autorelease_enabled: Tests ncdevmem in autorelease mode

All check_sockopt tests gracefully skip with KsftSkipEx if
SO_DEVMEM_AUTORELEASE is not supported by the kernel.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/drivers/net/hw/devmem.py  | 115 +++++++++++++++++++++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c |  20 +++-
 2 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py
index 45c2d49d55b6..29ec179d651f 100755
--- a/tools/testing/selftests/drivers/net/hw/devmem.py
+++ b/tools/testing/selftests/drivers/net/hw/devmem.py
@@ -1,6 +1,9 @@
 #!/usr/bin/env python3
 # SPDX-License-Identifier: GPL-2.0
 
+import socket
+import errno
+
 from os import path
 from lib.py import ksft_run, ksft_exit
 from lib.py import ksft_eq, KsftSkipEx
@@ -63,12 +66,122 @@ def check_tx_chunks(cfg) -> None:
     ksft_eq(socat.stdout.strip(), "hello\nworld")
 
 
+@ksft_disruptive
+def check_autorelease_disabled(cfg) -> None:
+    """Test RX with autorelease disabled (requires manual token release in ncdevmem)"""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7 -A 0"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+@ksft_disruptive
+def check_autorelease_enabled(cfg) -> None:
+    """Test RX with autorelease enabled (requires token autorelease in ncdevmem)"""
+    require_devmem(cfg)
+
+    port = rand_port()
+    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
+    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7 -A 1"
+
+    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
+        wait_port_listen(port)
+        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
+            head -c 1K | {socat}", host=cfg.remote, shell=True)
+
+    ksft_eq(ncdevmem.ret, 0)
+
+
+def check_sockopt_autorelease_default(cfg) -> None:
+    """Test that SO_DEVMEM_AUTORELEASE default is 0"""
+    SO_DEVMEM_AUTORELEASE = 85
+
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE)
+        ksft_eq(val, 0, "Default autorelease should be 0")
+    except OSError as e:
+        if e.errno == errno.ENOPROTOOPT:
+            raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported")
+        raise
+    finally:
+        sock.close()
+
+
+def check_sockopt_autorelease_set_0(cfg) -> None:
+    """Test setting SO_DEVMEM_AUTORELEASE to 0"""
+    SO_DEVMEM_AUTORELEASE = 85
+
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 0)
+        val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE)
+        ksft_eq(val, 0, "Autorelease should be 0 after setting")
+    except OSError as e:
+        if e.errno == errno.ENOPROTOOPT:
+            raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported")
+        raise
+    finally:
+        sock.close()
+
+
+def check_sockopt_autorelease_set_1(cfg) -> None:
+    """Test setting SO_DEVMEM_AUTORELEASE to 1"""
+    SO_DEVMEM_AUTORELEASE = 85
+
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        # First set to 0
+        sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 0)
+        # Then set back to 1
+        sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 1)
+        val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE)
+        ksft_eq(val, 1, "Autorelease should be 1 after setting")
+    except OSError as e:
+        if e.errno == errno.ENOPROTOOPT:
+            raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported")
+        raise
+    finally:
+        sock.close()
+
+
+def check_sockopt_autorelease_invalid(cfg) -> None:
+    """Test that SO_DEVMEM_AUTORELEASE rejects invalid values"""
+    SO_DEVMEM_AUTORELEASE = 85
+
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        try:
+            sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 2)
+            raise Exception("setsockopt should have failed with EINVAL")
+        except OSError as e:
+            if e.errno == errno.ENOPROTOOPT:
+                raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported")
+            ksft_eq(e.errno, errno.EINVAL, "Should fail with EINVAL for invalid value")
+    finally:
+        sock.close()
+
+
 def main() -> None:
     with NetDrvEpEnv(__file__) as cfg:
         cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem")
         cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
 
-        ksft_run([check_rx, check_tx, check_tx_chunks],
+        ksft_run([check_rx, check_tx, check_tx_chunks,
+                  check_autorelease_enabled,
+                  check_autorelease_disabled,
+                  check_sockopt_autorelease_default,
+                  check_sockopt_autorelease_set_0,
+                  check_sockopt_autorelease_set_1,
+                  check_sockopt_autorelease_invalid],
                  args=(cfg, ))
     ksft_exit()
 
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 3288ed04ce08..34d608d07bec 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -83,6 +83,10 @@
 #define MSG_SOCK_DEVMEM 0x2000000
 #endif
 
+#ifndef SO_DEVMEM_AUTORELEASE
+#define SO_DEVMEM_AUTORELEASE 85
+#endif
+
 #define MAX_IOV 1024
 
 static size_t max_chunk;
@@ -97,6 +101,7 @@ static unsigned int ifindex;
 static unsigned int dmabuf_id;
 static uint32_t tx_dmabuf_id;
 static int waittime_ms = 500;
+static int autorelease = -1;
 
 /* System state loaded by current_config_load() */
 #define MAX_FLOWS	8
@@ -890,6 +895,16 @@ static int do_server(struct memory_buffer *mem)
 	if (enable_reuseaddr(socket_fd))
 		goto err_close_socket;
 
+	if (autorelease >= 0) {
+		ret = setsockopt(socket_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE,
+				 &autorelease, sizeof(autorelease));
+		if (ret) {
+			pr_err("SO_DEVMEM_AUTORELEASE failed");
+			goto err_close_socket;
+		}
+		fprintf(stderr, "Set SO_DEVMEM_AUTORELEASE to %d\n", autorelease);
+	}
+
 	fprintf(stderr, "binding to address %s:%d\n", server_ip,
 		ntohs(server_sin.sin6_port));
 
@@ -1397,7 +1412,7 @@ int main(int argc, char *argv[])
 	int is_server = 0, opt;
 	int ret, err = 1;
 
-	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:")) != -1) {
+	while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:A:")) != -1) {
 		switch (opt) {
 		case 'l':
 			is_server = 1;
@@ -1426,6 +1441,9 @@ int main(int argc, char *argv[])
 		case 'z':
 			max_chunk = atoi(optarg);
 			break;
+		case 'A':
+			autorelease = atoi(optarg);
+			break;
 		case '?':
 			fprintf(stderr, "unknown option: %c\n", optopt);
 			break;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
@ 2025-11-05 16:02   ` kernel test robot
  2025-11-05 20:55   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-11-05 16:02 UTC (permalink / raw)
  To: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Mina Almasry
  Cc: llvm, oe-kbuild-all, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, Stanislav Fomichev, Bobby Eshleman

Hi Bobby,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 255d75ef029f33f75fcf5015052b7302486f7ad2]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-rename-tx_vec-to-vec-in-dmabuf-binding/20251105-092703
base:   255d75ef029f33f75fcf5015052b7302486f7ad2
patch link:    https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-3-ea98cf4d40b3%40meta.com
patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251105/202511052307.doTV8fDn-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251105/202511052307.doTV8fDn-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511052307.doTV8fDn-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> net/core/sock.c:1107:12: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
    1107 |                                 return ret;
         |                                        ^~~
   net/core/sock.c:1095:9: note: initialize the variable 'ret' to silence this warning
    1095 |         int ret;
         |                ^
         |                 = 0
   1 warning generated.
--
>> net/ipv4/tcp.c:2626:6: warning: variable 'refs' is uninitialized when used here [-Wuninitialized]
    2626 |                                         refs++;
         |                                         ^~~~
   net/ipv4/tcp.c:2496:10: note: initialize the variable 'refs' to silence this warning
    2496 |         int refs;
         |                 ^
         |                  = 0
   1 warning generated.


vim +/ret +1107 net/core/sock.c

  1085	
  1086	static noinline_for_stack int
  1087	sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens,
  1088					    unsigned int num_tokens)
  1089	{
  1090		struct net_iov *niov;
  1091		unsigned int i, j;
  1092		netmem_ref netmem;
  1093		unsigned int token;
  1094		int num_frags = 0;
  1095		int ret;
  1096	
  1097		if (!sk->sk_devmem_info.binding)
  1098			return -EINVAL;
  1099	
  1100		for (i = 0; i < num_tokens; i++) {
  1101			for (j = 0; j < tokens[i].token_count; j++) {
  1102				token = tokens[i].token_start + j;
  1103				if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE)
  1104					break;
  1105	
  1106				if (++num_frags > MAX_DONTNEED_FRAGS)
> 1107					return ret;
  1108	
  1109				niov = sk->sk_devmem_info.binding->vec[token];
  1110				if (atomic_dec_and_test(&niov->uref)) {
  1111					netmem = net_iov_to_netmem(niov);
  1112					WARN_ON_ONCE(!napi_pp_put_page(netmem));
  1113				}
  1114				ret++;
  1115			}
  1116		}
  1117	
  1118		atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs);
  1119	
  1120		return ret;
  1121	}
  1122	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control
  2025-11-05  1:23 ` [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control Bobby Eshleman
@ 2025-11-05 17:16   ` kernel test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-11-05 17:16 UTC (permalink / raw)
  To: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Mina Almasry
  Cc: oe-kbuild-all, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, Stanislav Fomichev, Bobby Eshleman

Hi Bobby,

kernel test robot noticed the following build errors:

[auto build test ERROR on 255d75ef029f33f75fcf5015052b7302486f7ad2]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-rename-tx_vec-to-vec-in-dmabuf-binding/20251105-092703
base:   255d75ef029f33f75fcf5015052b7302486f7ad2
patch link:    https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-4-ea98cf4d40b3%40meta.com
patch subject: [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control
config: sparc64-randconfig-002-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060041.0DVnS1CV-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251106/202511060041.0DVnS1CV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511060041.0DVnS1CV-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/core/sock.c: In function 'sk_setsockopt':
>> net/core/sock.c:1395:7: error: 'SO_DEVMEM_AUTORELEASE' undeclared (first use in this function)
    1395 |  case SO_DEVMEM_AUTORELEASE:
         |       ^~~~~~~~~~~~~~~~~~~~~
   net/core/sock.c:1395:7: note: each undeclared identifier is reported only once for each function it appears in
   net/core/sock.c: In function 'sk_getsockopt':
   net/core/sock.c:2255:7: error: 'SO_DEVMEM_AUTORELEASE' undeclared (first use in this function)
    2255 |  case SO_DEVMEM_AUTORELEASE:
         |       ^~~~~~~~~~~~~~~~~~~~~


vim +/SO_DEVMEM_AUTORELEASE +1395 net/core/sock.c

  1282	
  1283	/*
  1284	 *	This is meant for all protocols to use and covers goings on
  1285	 *	at the socket level. Everything here is generic.
  1286	 */
  1287	
  1288	int sk_setsockopt(struct sock *sk, int level, int optname,
  1289			  sockptr_t optval, unsigned int optlen)
  1290	{
  1291		struct so_timestamping timestamping;
  1292		struct socket *sock = sk->sk_socket;
  1293		struct sock_txtime sk_txtime;
  1294		int val;
  1295		int valbool;
  1296		struct linger ling;
  1297		int ret = 0;
  1298	
  1299		/*
  1300		 *	Options without arguments
  1301		 */
  1302	
  1303		if (optname == SO_BINDTODEVICE)
  1304			return sock_setbindtodevice(sk, optval, optlen);
  1305	
  1306		if (optlen < sizeof(int))
  1307			return -EINVAL;
  1308	
  1309		if (copy_from_sockptr(&val, optval, sizeof(val)))
  1310			return -EFAULT;
  1311	
  1312		valbool = val ? 1 : 0;
  1313	
  1314		/* handle options which do not require locking the socket. */
  1315		switch (optname) {
  1316		case SO_PRIORITY:
  1317			if (sk_set_prio_allowed(sk, val)) {
  1318				sock_set_priority(sk, val);
  1319				return 0;
  1320			}
  1321			return -EPERM;
  1322		case SO_TYPE:
  1323		case SO_PROTOCOL:
  1324		case SO_DOMAIN:
  1325		case SO_ERROR:
  1326			return -ENOPROTOOPT;
  1327	#ifdef CONFIG_NET_RX_BUSY_POLL
  1328		case SO_BUSY_POLL:
  1329			if (val < 0)
  1330				return -EINVAL;
  1331			WRITE_ONCE(sk->sk_ll_usec, val);
  1332			return 0;
  1333		case SO_PREFER_BUSY_POLL:
  1334			if (valbool && !sockopt_capable(CAP_NET_ADMIN))
  1335				return -EPERM;
  1336			WRITE_ONCE(sk->sk_prefer_busy_poll, valbool);
  1337			return 0;
  1338		case SO_BUSY_POLL_BUDGET:
  1339			if (val > READ_ONCE(sk->sk_busy_poll_budget) &&
  1340			    !sockopt_capable(CAP_NET_ADMIN))
  1341				return -EPERM;
  1342			if (val < 0 || val > U16_MAX)
  1343				return -EINVAL;
  1344			WRITE_ONCE(sk->sk_busy_poll_budget, val);
  1345			return 0;
  1346	#endif
  1347		case SO_MAX_PACING_RATE:
  1348			{
  1349			unsigned long ulval = (val == ~0U) ? ~0UL : (unsigned int)val;
  1350			unsigned long pacing_rate;
  1351	
  1352			if (sizeof(ulval) != sizeof(val) &&
  1353			    optlen >= sizeof(ulval) &&
  1354			    copy_from_sockptr(&ulval, optval, sizeof(ulval))) {
  1355				return -EFAULT;
  1356			}
  1357			if (ulval != ~0UL)
  1358				cmpxchg(&sk->sk_pacing_status,
  1359					SK_PACING_NONE,
  1360					SK_PACING_NEEDED);
  1361			/* Pairs with READ_ONCE() from sk_getsockopt() */
  1362			WRITE_ONCE(sk->sk_max_pacing_rate, ulval);
  1363			pacing_rate = READ_ONCE(sk->sk_pacing_rate);
  1364			if (ulval < pacing_rate)
  1365				WRITE_ONCE(sk->sk_pacing_rate, ulval);
  1366			return 0;
  1367			}
  1368		case SO_TXREHASH:
  1369			if (!sk_is_tcp(sk))
  1370				return -EOPNOTSUPP;
  1371			if (val < -1 || val > 1)
  1372				return -EINVAL;
  1373			if ((u8)val == SOCK_TXREHASH_DEFAULT)
  1374				val = READ_ONCE(sock_net(sk)->core.sysctl_txrehash);
  1375			/* Paired with READ_ONCE() in tcp_rtx_synack()
  1376			 * and sk_getsockopt().
  1377			 */
  1378			WRITE_ONCE(sk->sk_txrehash, (u8)val);
  1379			return 0;
  1380		case SO_PEEK_OFF:
  1381			{
  1382			int (*set_peek_off)(struct sock *sk, int val);
  1383	
  1384			set_peek_off = READ_ONCE(sock->ops)->set_peek_off;
  1385			if (set_peek_off)
  1386				ret = set_peek_off(sk, val);
  1387			else
  1388				ret = -EOPNOTSUPP;
  1389			return ret;
  1390			}
  1391	#ifdef CONFIG_PAGE_POOL
  1392		case SO_DEVMEM_DONTNEED:
  1393			return sock_devmem_dontneed(sk, optval, optlen);
  1394	
> 1395		case SO_DEVMEM_AUTORELEASE:
  1396			return sock_devmem_set_autorelease(sk, optval, optlen);
  1397	#endif
  1398		case SO_SNDTIMEO_OLD:
  1399		case SO_SNDTIMEO_NEW:
  1400			return sock_set_timeout(&sk->sk_sndtimeo, optval,
  1401						optlen, optname == SO_SNDTIMEO_OLD);
  1402		case SO_RCVTIMEO_OLD:
  1403		case SO_RCVTIMEO_NEW:
  1404			return sock_set_timeout(&sk->sk_rcvtimeo, optval,
  1405						optlen, optname == SO_RCVTIMEO_OLD);
  1406		}
  1407	
  1408		sockopt_lock_sock(sk);
  1409	
  1410		switch (optname) {
  1411		case SO_DEBUG:
  1412			if (val && !sockopt_capable(CAP_NET_ADMIN))
  1413				ret = -EACCES;
  1414			else
  1415				sock_valbool_flag(sk, SOCK_DBG, valbool);
  1416			break;
  1417		case SO_REUSEADDR:
  1418			sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
  1419			break;
  1420		case SO_REUSEPORT:
  1421			if (valbool && !sk_is_inet(sk))
  1422				ret = -EOPNOTSUPP;
  1423			else
  1424				sk->sk_reuseport = valbool;
  1425			break;
  1426		case SO_DONTROUTE:
  1427			sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
  1428			sk_dst_reset(sk);
  1429			break;
  1430		case SO_BROADCAST:
  1431			sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
  1432			break;
  1433		case SO_SNDBUF:
  1434			/* Don't error on this BSD doesn't and if you think
  1435			 * about it this is right. Otherwise apps have to
  1436			 * play 'guess the biggest size' games. RCVBUF/SNDBUF
  1437			 * are treated in BSD as hints
  1438			 */
  1439			val = min_t(u32, val, READ_ONCE(sysctl_wmem_max));
  1440	set_sndbuf:
  1441			/* Ensure val * 2 fits into an int, to prevent max_t()
  1442			 * from treating it as a negative value.
  1443			 */
  1444			val = min_t(int, val, INT_MAX / 2);
  1445			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
  1446			WRITE_ONCE(sk->sk_sndbuf,
  1447				   max_t(int, val * 2, SOCK_MIN_SNDBUF));
  1448			/* Wake up sending tasks if we upped the value. */
  1449			sk->sk_write_space(sk);
  1450			break;
  1451	
  1452		case SO_SNDBUFFORCE:
  1453			if (!sockopt_capable(CAP_NET_ADMIN)) {
  1454				ret = -EPERM;
  1455				break;
  1456			}
  1457	
  1458			/* No negative values (to prevent underflow, as val will be
  1459			 * multiplied by 2).
  1460			 */
  1461			if (val < 0)
  1462				val = 0;
  1463			goto set_sndbuf;
  1464	
  1465		case SO_RCVBUF:
  1466			/* Don't error on this BSD doesn't and if you think
  1467			 * about it this is right. Otherwise apps have to
  1468			 * play 'guess the biggest size' games. RCVBUF/SNDBUF
  1469			 * are treated in BSD as hints
  1470			 */
  1471			__sock_set_rcvbuf(sk, min_t(u32, val, READ_ONCE(sysctl_rmem_max)));
  1472			break;
  1473	
  1474		case SO_RCVBUFFORCE:
  1475			if (!sockopt_capable(CAP_NET_ADMIN)) {
  1476				ret = -EPERM;
  1477				break;
  1478			}
  1479	
  1480			/* No negative values (to prevent underflow, as val will be
  1481			 * multiplied by 2).
  1482			 */
  1483			__sock_set_rcvbuf(sk, max(val, 0));
  1484			break;
  1485	
  1486		case SO_KEEPALIVE:
  1487			if (sk->sk_prot->keepalive)
  1488				sk->sk_prot->keepalive(sk, valbool);
  1489			sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
  1490			break;
  1491	
  1492		case SO_OOBINLINE:
  1493			sock_valbool_flag(sk, SOCK_URGINLINE, valbool);
  1494			break;
  1495	
  1496		case SO_NO_CHECK:
  1497			sk->sk_no_check_tx = valbool;
  1498			break;
  1499	
  1500		case SO_LINGER:
  1501			if (optlen < sizeof(ling)) {
  1502				ret = -EINVAL;	/* 1003.1g */
  1503				break;
  1504			}
  1505			if (copy_from_sockptr(&ling, optval, sizeof(ling))) {
  1506				ret = -EFAULT;
  1507				break;
  1508			}
  1509			if (!ling.l_onoff) {
  1510				sock_reset_flag(sk, SOCK_LINGER);
  1511			} else {
  1512				unsigned long t_sec = ling.l_linger;
  1513	
  1514				if (t_sec >= MAX_SCHEDULE_TIMEOUT / HZ)
  1515					WRITE_ONCE(sk->sk_lingertime, MAX_SCHEDULE_TIMEOUT);
  1516				else
  1517					WRITE_ONCE(sk->sk_lingertime, t_sec * HZ);
  1518				sock_set_flag(sk, SOCK_LINGER);
  1519			}
  1520			break;
  1521	
  1522		case SO_BSDCOMPAT:
  1523			break;
  1524	
  1525		case SO_TIMESTAMP_OLD:
  1526		case SO_TIMESTAMP_NEW:
  1527		case SO_TIMESTAMPNS_OLD:
  1528		case SO_TIMESTAMPNS_NEW:
  1529			sock_set_timestamp(sk, optname, valbool);
  1530			break;
  1531	
  1532		case SO_TIMESTAMPING_NEW:
  1533		case SO_TIMESTAMPING_OLD:
  1534			if (optlen == sizeof(timestamping)) {
  1535				if (copy_from_sockptr(&timestamping, optval,
  1536						      sizeof(timestamping))) {
  1537					ret = -EFAULT;
  1538					break;
  1539				}
  1540			} else {
  1541				memset(&timestamping, 0, sizeof(timestamping));
  1542				timestamping.flags = val;
  1543			}
  1544			ret = sock_set_timestamping(sk, optname, timestamping);
  1545			break;
  1546	
  1547		case SO_RCVLOWAT:
  1548			{
  1549			int (*set_rcvlowat)(struct sock *sk, int val) = NULL;
  1550	
  1551			if (val < 0)
  1552				val = INT_MAX;
  1553			if (sock)
  1554				set_rcvlowat = READ_ONCE(sock->ops)->set_rcvlowat;
  1555			if (set_rcvlowat)
  1556				ret = set_rcvlowat(sk, val);
  1557			else
  1558				WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
  1559			break;
  1560			}
  1561		case SO_ATTACH_FILTER: {
  1562			struct sock_fprog fprog;
  1563	
  1564			ret = copy_bpf_fprog_from_user(&fprog, optval, optlen);
  1565			if (!ret)
  1566				ret = sk_attach_filter(&fprog, sk);
  1567			break;
  1568		}
  1569		case SO_ATTACH_BPF:
  1570			ret = -EINVAL;
  1571			if (optlen == sizeof(u32)) {
  1572				u32 ufd;
  1573	
  1574				ret = -EFAULT;
  1575				if (copy_from_sockptr(&ufd, optval, sizeof(ufd)))
  1576					break;
  1577	
  1578				ret = sk_attach_bpf(ufd, sk);
  1579			}
  1580			break;
  1581	
  1582		case SO_ATTACH_REUSEPORT_CBPF: {
  1583			struct sock_fprog fprog;
  1584	
  1585			ret = copy_bpf_fprog_from_user(&fprog, optval, optlen);
  1586			if (!ret)
  1587				ret = sk_reuseport_attach_filter(&fprog, sk);
  1588			break;
  1589		}
  1590		case SO_ATTACH_REUSEPORT_EBPF:
  1591			ret = -EINVAL;
  1592			if (optlen == sizeof(u32)) {
  1593				u32 ufd;
  1594	
  1595				ret = -EFAULT;
  1596				if (copy_from_sockptr(&ufd, optval, sizeof(ufd)))
  1597					break;
  1598	
  1599				ret = sk_reuseport_attach_bpf(ufd, sk);
  1600			}
  1601			break;
  1602	
  1603		case SO_DETACH_REUSEPORT_BPF:
  1604			ret = reuseport_detach_prog(sk);
  1605			break;
  1606	
  1607		case SO_DETACH_FILTER:
  1608			ret = sk_detach_filter(sk);
  1609			break;
  1610	
  1611		case SO_LOCK_FILTER:
  1612			if (sock_flag(sk, SOCK_FILTER_LOCKED) && !valbool)
  1613				ret = -EPERM;
  1614			else
  1615				sock_valbool_flag(sk, SOCK_FILTER_LOCKED, valbool);
  1616			break;
  1617	
  1618		case SO_MARK:
  1619			if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
  1620			    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
  1621				ret = -EPERM;
  1622				break;
  1623			}
  1624	
  1625			__sock_set_mark(sk, val);
  1626			break;
  1627		case SO_RCVMARK:
  1628			sock_valbool_flag(sk, SOCK_RCVMARK, valbool);
  1629			break;
  1630	
  1631		case SO_RCVPRIORITY:
  1632			sock_valbool_flag(sk, SOCK_RCVPRIORITY, valbool);
  1633			break;
  1634	
  1635		case SO_RXQ_OVFL:
  1636			sock_valbool_flag(sk, SOCK_RXQ_OVFL, valbool);
  1637			break;
  1638	
  1639		case SO_WIFI_STATUS:
  1640			sock_valbool_flag(sk, SOCK_WIFI_STATUS, valbool);
  1641			break;
  1642	
  1643		case SO_NOFCS:
  1644			sock_valbool_flag(sk, SOCK_NOFCS, valbool);
  1645			break;
  1646	
  1647		case SO_SELECT_ERR_QUEUE:
  1648			sock_valbool_flag(sk, SOCK_SELECT_ERR_QUEUE, valbool);
  1649			break;
  1650	
  1651		case SO_PASSCRED:
  1652			if (sk_may_scm_recv(sk))
  1653				sk->sk_scm_credentials = valbool;
  1654			else
  1655				ret = -EOPNOTSUPP;
  1656			break;
  1657	
  1658		case SO_PASSSEC:
  1659			if (IS_ENABLED(CONFIG_SECURITY_NETWORK) && sk_may_scm_recv(sk))
  1660				sk->sk_scm_security = valbool;
  1661			else
  1662				ret = -EOPNOTSUPP;
  1663			break;
  1664	
  1665		case SO_PASSPIDFD:
  1666			if (sk_is_unix(sk))
  1667				sk->sk_scm_pidfd = valbool;
  1668			else
  1669				ret = -EOPNOTSUPP;
  1670			break;
  1671	
  1672		case SO_PASSRIGHTS:
  1673			if (sk_is_unix(sk))
  1674				sk->sk_scm_rights = valbool;
  1675			else
  1676				ret = -EOPNOTSUPP;
  1677			break;
  1678	
  1679		case SO_INCOMING_CPU:
  1680			reuseport_update_incoming_cpu(sk, val);
  1681			break;
  1682	
  1683		case SO_CNX_ADVICE:
  1684			if (val == 1)
  1685				dst_negative_advice(sk);
  1686			break;
  1687	
  1688		case SO_ZEROCOPY:
  1689			if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
  1690				if (!(sk_is_tcp(sk) ||
  1691				      (sk->sk_type == SOCK_DGRAM &&
  1692				       sk->sk_protocol == IPPROTO_UDP)))
  1693					ret = -EOPNOTSUPP;
  1694			} else if (sk->sk_family != PF_RDS) {
  1695				ret = -EOPNOTSUPP;
  1696			}
  1697			if (!ret) {
  1698				if (val < 0 || val > 1)
  1699					ret = -EINVAL;
  1700				else
  1701					sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
  1702			}
  1703			break;
  1704	
  1705		case SO_TXTIME:
  1706			if (optlen != sizeof(struct sock_txtime)) {
  1707				ret = -EINVAL;
  1708				break;
  1709			} else if (copy_from_sockptr(&sk_txtime, optval,
  1710				   sizeof(struct sock_txtime))) {
  1711				ret = -EFAULT;
  1712				break;
  1713			} else if (sk_txtime.flags & ~SOF_TXTIME_FLAGS_MASK) {
  1714				ret = -EINVAL;
  1715				break;
  1716			}
  1717			/* CLOCK_MONOTONIC is only used by sch_fq, and this packet
  1718			 * scheduler has enough safe guards.
  1719			 */
  1720			if (sk_txtime.clockid != CLOCK_MONOTONIC &&
  1721			    !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
  1722				ret = -EPERM;
  1723				break;
  1724			}
  1725	
  1726			ret = sockopt_validate_clockid(sk_txtime.clockid);
  1727			if (ret)
  1728				break;
  1729	
  1730			sock_valbool_flag(sk, SOCK_TXTIME, true);
  1731			sk->sk_clockid = sk_txtime.clockid;
  1732			sk->sk_txtime_deadline_mode =
  1733				!!(sk_txtime.flags & SOF_TXTIME_DEADLINE_MODE);
  1734			sk->sk_txtime_report_errors =
  1735				!!(sk_txtime.flags & SOF_TXTIME_REPORT_ERRORS);
  1736			break;
  1737	
  1738		case SO_BINDTOIFINDEX:
  1739			ret = sock_bindtoindex_locked(sk, val);
  1740			break;
  1741	
  1742		case SO_BUF_LOCK:
  1743			if (val & ~SOCK_BUF_LOCK_MASK) {
  1744				ret = -EINVAL;
  1745				break;
  1746			}
  1747			sk->sk_userlocks = val | (sk->sk_userlocks &
  1748						  ~SOCK_BUF_LOCK_MASK);
  1749			break;
  1750	
  1751		case SO_RESERVE_MEM:
  1752		{
  1753			int delta;
  1754	
  1755			if (val < 0) {
  1756				ret = -EINVAL;
  1757				break;
  1758			}
  1759	
  1760			delta = val - sk->sk_reserved_mem;
  1761			if (delta < 0)
  1762				sock_release_reserved_memory(sk, -delta);
  1763			else
  1764				ret = sock_reserve_memory(sk, delta);
  1765			break;
  1766		}
  1767	
  1768		default:
  1769			ret = -ENOPROTOOPT;
  1770			break;
  1771		}
  1772		sockopt_release_sock(sk);
  1773		return ret;
  1774	}
  1775	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05  1:23 ` [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option Bobby Eshleman
@ 2025-11-05 17:34   ` Stanislav Fomichev
  2025-11-05 17:44     ` Mina Almasry
  2025-11-05 17:59     ` Bobby Eshleman
  0 siblings, 2 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2025-11-05 17:34 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On 11/04, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
> 

[..]

> +Autorelease Control
> +~~~~~~~~~~~~~~~~~~~

Have you considered an option to have this flag on the dmabuf binding
itself? This will let us keep everything in ynl and not add a new socket
option. I think also semantically, this is a property of the binding
and not the socket? (not sure what's gonna happen if we have
autorelease=on and autorelease=off sockets receiving to the same
dmabuf)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05 17:34   ` Stanislav Fomichev
@ 2025-11-05 17:44     ` Mina Almasry
  2025-11-05 19:31       ` Stanislav Fomichev
  2025-11-05 17:59     ` Bobby Eshleman
  1 sibling, 1 reply; 18+ messages in thread
From: Mina Almasry @ 2025-11-05 17:44 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 11/04, Bobby Eshleman wrote:
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
>
> [..]
>
> > +Autorelease Control
> > +~~~~~~~~~~~~~~~~~~~
>
> Have you considered an option to have this flag on the dmabuf binding
> itself? This will let us keep everything in ynl and not add a new socket
> option. I think also semantically, this is a property of the binding
> and not the socket? (not sure what's gonna happen if we have
> autorelease=on and autorelease=off sockets receiving to the same
> dmabuf)

I think this thread (and maybe other comments on that patch) is the
context that missed your inbox:

https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com/

Let us know if you disagree.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05 17:34   ` Stanislav Fomichev
  2025-11-05 17:44     ` Mina Almasry
@ 2025-11-05 17:59     ` Bobby Eshleman
  1 sibling, 0 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-05 17:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
	David Ahern, Arnd Bergmann, Jonathan Corbet, Andrew Lunn,
	Shuah Khan, Mina Almasry, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On Wed, Nov 05, 2025 at 09:34:03AM -0800, Stanislav Fomichev wrote:
> On 11/04, Bobby Eshleman wrote:
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > 
> 
> [..]
> 
> > +Autorelease Control
> > +~~~~~~~~~~~~~~~~~~~
> 
> Have you considered an option to have this flag on the dmabuf binding
> itself? This will let us keep everything in ynl and not add a new socket
> option. I think also semantically, this is a property of the binding
> and not the socket? (not sure what's gonna happen if we have
> autorelease=on and autorelease=off sockets receiving to the same
> dmabuf)

This was our initial instinct too and was the implementation in the
prior version, but we opted for a socket-based property because it
simplifies backwards compatibility with multi-binding steering rules. In
this case, where bindings may have different autorelease settings, the
recv path would need to error out once any binding with different
autorelease value was detected, because the dont_need path doesn't have
any context to know if any specific token is part of the socket's xarray
(autorelease=on) or part of the binding->vec (autorelease=off).

At the socket level we can just prevent the mode switch by counting
outstanding references... to do this at the binding level, I think we
have to revert back to the ethtool approach we experimented with earlier
(trying to resolve steering rules to queues, and then check their
binding->autorelease values and make sure they are consistent).

This should work out off the box for mixed-modes, given then outstanding
ref rule.

Probably should add a test for specifically that...

Best,
Bobby

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05 17:44     ` Mina Almasry
@ 2025-11-05 19:31       ` Stanislav Fomichev
  2025-11-05 23:17         ` Stanislav Fomichev
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2025-11-05 19:31 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On 11/05, Mina Almasry wrote:
> On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 11/04, Bobby Eshleman wrote:
> > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> >
> > [..]
> >
> > > +Autorelease Control
> > > +~~~~~~~~~~~~~~~~~~~
> >
> > Have you considered an option to have this flag on the dmabuf binding
> > itself? This will let us keep everything in ynl and not add a new socket
> > option. I think also semantically, this is a property of the binding
> > and not the socket? (not sure what's gonna happen if we have
> > autorelease=on and autorelease=off sockets receiving to the same
> > dmabuf)
> 
> I think this thread (and maybe other comments on that patch) is the
> context that missed your inbox:
> 
> https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com/
> 
> Let us know if you disagree.

Thanks, I did miss that whole v5 because I was OOO, let me take a look!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
  2025-11-05 16:02   ` kernel test robot
@ 2025-11-05 20:55   ` kernel test robot
  2025-11-06 15:11   ` Dan Carpenter
  2025-11-06 15:14   ` Dan Carpenter
  3 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-11-05 20:55 UTC (permalink / raw)
  To: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, Mina Almasry
  Cc: oe-kbuild-all, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, Stanislav Fomichev, Bobby Eshleman

Hi Bobby,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 255d75ef029f33f75fcf5015052b7302486f7ad2]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-rename-tx_vec-to-vec-in-dmabuf-binding/20251105-092703
base:   255d75ef029f33f75fcf5015052b7302486f7ad2
patch link:    https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-3-ea98cf4d40b3%40meta.com
patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
config: arc-nsimosci_hs_defconfig (https://download.01.org/0day-ci/archive/20251106/202511060345.AQs0FTNg-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251106/202511060345.AQs0FTNg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511060345.AQs0FTNg-lkp@intel.com/

All warnings (new ones prefixed by >>):

   net/ipv4/tcp.c: In function 'tcp_recvmsg_dmabuf':
>> net/ipv4/tcp.c:2661:12: warning: 'refs' is used uninitialized [-Wuninitialized]
    2661 |         if (refs > 0)
         |            ^
   net/ipv4/tcp.c:2496:13: note: 'refs' was declared here
    2496 |         int refs;
         |             ^~~~


vim +/refs +2661 net/ipv4/tcp.c

  2481	
  2482	/* On error, returns the -errno. On success, returns number of bytes sent to the
  2483	 * user. May not consume all of @remaining_len.
  2484	 */
  2485	static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
  2486				      unsigned int offset, struct msghdr *msg,
  2487				      int remaining_len)
  2488	{
  2489		struct net_devmem_dmabuf_binding *binding = NULL;
  2490		struct dmabuf_cmsg dmabuf_cmsg = { 0 };
  2491		struct tcp_xa_pool tcp_xa_pool;
  2492		unsigned int start;
  2493		int i, copy, n;
  2494		int sent = 0;
  2495		int err = 0;
  2496		int refs;
  2497	
  2498		tcp_xa_pool.max = 0;
  2499		tcp_xa_pool.idx = 0;
  2500		do {
  2501			start = skb_headlen(skb);
  2502	
  2503			if (skb_frags_readable(skb)) {
  2504				err = -ENODEV;
  2505				goto out;
  2506			}
  2507	
  2508			/* Copy header. */
  2509			copy = start - offset;
  2510			if (copy > 0) {
  2511				copy = min(copy, remaining_len);
  2512	
  2513				n = copy_to_iter(skb->data + offset, copy,
  2514						 &msg->msg_iter);
  2515				if (n != copy) {
  2516					err = -EFAULT;
  2517					goto out;
  2518				}
  2519	
  2520				offset += copy;
  2521				remaining_len -= copy;
  2522	
  2523				/* First a dmabuf_cmsg for # bytes copied to user
  2524				 * buffer.
  2525				 */
  2526				memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg));
  2527				dmabuf_cmsg.frag_size = copy;
  2528				err = put_cmsg_notrunc(msg, SOL_SOCKET,
  2529						       SO_DEVMEM_LINEAR,
  2530						       sizeof(dmabuf_cmsg),
  2531						       &dmabuf_cmsg);
  2532				if (err)
  2533					goto out;
  2534	
  2535				sent += copy;
  2536	
  2537				if (remaining_len == 0)
  2538					goto out;
  2539			}
  2540	
  2541			/* after that, send information of dmabuf pages through a
  2542			 * sequence of cmsg
  2543			 */
  2544			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
  2545				skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
  2546				struct net_iov *niov;
  2547				u64 frag_offset;
  2548				u32 token;
  2549				int end;
  2550	
  2551				/* !skb_frags_readable() should indicate that ALL the
  2552				 * frags in this skb are dmabuf net_iovs. We're checking
  2553				 * for that flag above, but also check individual frags
  2554				 * here. If the tcp stack is not setting
  2555				 * skb_frags_readable() correctly, we still don't want
  2556				 * to crash here.
  2557				 */
  2558				if (!skb_frag_net_iov(frag)) {
  2559					net_err_ratelimited("Found non-dmabuf skb with net_iov");
  2560					err = -ENODEV;
  2561					goto out;
  2562				}
  2563	
  2564				niov = skb_frag_net_iov(frag);
  2565				if (!net_is_devmem_iov(niov)) {
  2566					err = -ENODEV;
  2567					goto out;
  2568				}
  2569	
  2570				end = start + skb_frag_size(frag);
  2571				copy = end - offset;
  2572	
  2573				if (copy > 0) {
  2574					copy = min(copy, remaining_len);
  2575	
  2576					frag_offset = net_iov_virtual_addr(niov) +
  2577						      skb_frag_off(frag) + offset -
  2578						      start;
  2579					dmabuf_cmsg.frag_offset = frag_offset;
  2580					dmabuf_cmsg.frag_size = copy;
  2581	
  2582					binding = net_devmem_iov_binding(niov);
  2583	
  2584					if (!sk->sk_devmem_info.binding)
  2585						sk->sk_devmem_info.binding = binding;
  2586	
  2587					if (sk->sk_devmem_info.binding != binding) {
  2588						err = -EFAULT;
  2589						goto out;
  2590					}
  2591	
  2592					if (sk->sk_devmem_info.autorelease) {
  2593						err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
  2594									 skb_shinfo(skb)->nr_frags - i);
  2595						if (err)
  2596							goto out;
  2597	
  2598						dmabuf_cmsg.frag_token =
  2599							tcp_xa_pool.tokens[tcp_xa_pool.idx];
  2600					} else {
  2601						token = net_iov_virtual_addr(niov) >> PAGE_SHIFT;
  2602						dmabuf_cmsg.frag_token = token;
  2603					}
  2604	
  2605	
  2606					/* Will perform the exchange later */
  2607					dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
  2608	
  2609					offset += copy;
  2610					remaining_len -= copy;
  2611	
  2612					err = put_cmsg_notrunc(msg, SOL_SOCKET,
  2613							       SO_DEVMEM_DMABUF,
  2614							       sizeof(dmabuf_cmsg),
  2615							       &dmabuf_cmsg);
  2616					if (err)
  2617						goto out;
  2618	
  2619					if (sk->sk_devmem_info.autorelease) {
  2620						atomic_long_inc(&niov->pp_ref_count);
  2621						tcp_xa_pool.netmems[tcp_xa_pool.idx++] =
  2622							skb_frag_netmem(frag);
  2623					} else {
  2624						if (atomic_inc_return(&niov->uref) == 1)
  2625							atomic_long_inc(&niov->pp_ref_count);
  2626						refs++;
  2627					}
  2628	
  2629					sent += copy;
  2630	
  2631					if (remaining_len == 0)
  2632						goto out;
  2633				}
  2634				start = end;
  2635			}
  2636	
  2637			tcp_xa_pool_commit(sk, &tcp_xa_pool);
  2638	
  2639			if (!remaining_len)
  2640				goto out;
  2641	
  2642			/* if remaining_len is not satisfied yet, we need to go to the
  2643			 * next frag in the frag_list to satisfy remaining_len.
  2644			 */
  2645			skb = skb_shinfo(skb)->frag_list ?: skb->next;
  2646	
  2647			offset = offset - start;
  2648		} while (skb);
  2649	
  2650		if (remaining_len) {
  2651			err = -EFAULT;
  2652			goto out;
  2653		}
  2654	
  2655	out:
  2656		tcp_xa_pool_commit(sk, &tcp_xa_pool);
  2657	
  2658		if (!sent)
  2659			sent = err;
  2660	
> 2661		if (refs > 0)
  2662			atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs);
  2663	
  2664		return sent;
  2665	}
  2666	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05 19:31       ` Stanislav Fomichev
@ 2025-11-05 23:17         ` Stanislav Fomichev
  2025-11-07  2:22           ` Bobby Eshleman
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2025-11-05 23:17 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On 11/05, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
> > On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 11/04, Bobby Eshleman wrote:
> > > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > > >
> > >
> > > [..]
> > >
> > > > +Autorelease Control
> > > > +~~~~~~~~~~~~~~~~~~~
> > >
> > > Have you considered an option to have this flag on the dmabuf binding
> > > itself? This will let us keep everything in ynl and not add a new socket
> > > option. I think also semantically, this is a property of the binding
> > > and not the socket? (not sure what's gonna happen if we have
> > > autorelease=on and autorelease=off sockets receiving to the same
> > > dmabuf)
> > 
> > I think this thread (and maybe other comments on that patch) is the
> > context that missed your inbox:
> > 
> > https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com/
> > 
> > Let us know if you disagree.
> 
> Thanks, I did miss that whole v5 because I was OOO, let me take a look!

Thank you for the context!

I think that the current approach is ok, we can go with that, but I
wonder whether we can simplify things a bit? What if we prohibit the
co-existence of autorelease=on and autorelease=off sockets on the
system? The first binding basically locks the kernel path into one way or
the other (presumably by using static-branch) and prohibits new bindings
that use a different mode. It will let us still keep the mode on the binding
and will help us not think about the co-existance (we can also still keep
things like one-dmabuf-per-socket restrictions in the new mode, etc).

I think for you, Mina, this should still work? You have a knob to go back
to the old mode if needed. At the same time, we keep the UAPI surface
smaller and keep the path more simple. Ideally, we can also deprecate
the old mode at some point (if you manage to successfully migrate of
course). WDYT?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
  2025-11-05 16:02   ` kernel test robot
  2025-11-05 20:55   ` kernel test robot
@ 2025-11-06 15:11   ` Dan Carpenter
  2025-11-06 15:14   ` Dan Carpenter
  3 siblings, 0 replies; 18+ messages in thread
From: Dan Carpenter @ 2025-11-06 15:11 UTC (permalink / raw)
  To: oe-kbuild, Bobby Eshleman, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, Neal Cardwell, David Ahern, Arnd Bergmann,
	Jonathan Corbet, Andrew Lunn, Shuah Khan, Mina Almasry
  Cc: lkp, oe-kbuild-all, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, Stanislav Fomichev, Bobby Eshleman

Hi Bobby,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-rename-tx_vec-to-vec-in-dmabuf-binding/20251105-092703
base:   255d75ef029f33f75fcf5015052b7302486f7ad2
patch link:    https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-3-ea98cf4d40b3%40meta.com
patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
config: openrisc-randconfig-r073-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060119.MAzcsLoN-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 10.5.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202511060119.MAzcsLoN-lkp@intel.com/

New smatch warnings:
net/core/sock.c:1107 sock_devmem_dontneed_manual_release() error: uninitialized symbol 'ret'.

vim +/ret +1107 net/core/sock.c

45aa39492cf4dd Bobby Eshleman 2025-11-04  1086  static noinline_for_stack int
45aa39492cf4dd Bobby Eshleman 2025-11-04  1087  sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens,
45aa39492cf4dd Bobby Eshleman 2025-11-04  1088  				    unsigned int num_tokens)
45aa39492cf4dd Bobby Eshleman 2025-11-04  1089  {
45aa39492cf4dd Bobby Eshleman 2025-11-04  1090  	struct net_iov *niov;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1091  	unsigned int i, j;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1092  	netmem_ref netmem;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1093  	unsigned int token;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1094  	int num_frags = 0;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1095  	int ret;

ret needs to be = 0;

45aa39492cf4dd Bobby Eshleman 2025-11-04  1096  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1097  	if (!sk->sk_devmem_info.binding)
45aa39492cf4dd Bobby Eshleman 2025-11-04  1098  		return -EINVAL;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1099  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1100  	for (i = 0; i < num_tokens; i++) {
45aa39492cf4dd Bobby Eshleman 2025-11-04  1101  		for (j = 0; j < tokens[i].token_count; j++) {
45aa39492cf4dd Bobby Eshleman 2025-11-04  1102  			token = tokens[i].token_start + j;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1103  			if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE)
45aa39492cf4dd Bobby Eshleman 2025-11-04  1104  				break;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1105  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1106  			if (++num_frags > MAX_DONTNEED_FRAGS)
45aa39492cf4dd Bobby Eshleman 2025-11-04 @1107  				return ret;

Uninitialized.  It's always a good idea to test code with
CONFIG_INIT_STACK_ALL_PATTERN.

45aa39492cf4dd Bobby Eshleman 2025-11-04  1108  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1109  			niov = sk->sk_devmem_info.binding->vec[token];
45aa39492cf4dd Bobby Eshleman 2025-11-04  1110  			if (atomic_dec_and_test(&niov->uref)) {
45aa39492cf4dd Bobby Eshleman 2025-11-04  1111  				netmem = net_iov_to_netmem(niov);
45aa39492cf4dd Bobby Eshleman 2025-11-04  1112  				WARN_ON_ONCE(!napi_pp_put_page(netmem));
45aa39492cf4dd Bobby Eshleman 2025-11-04  1113  			}
45aa39492cf4dd Bobby Eshleman 2025-11-04  1114  			ret++;

Uninitialized.

45aa39492cf4dd Bobby Eshleman 2025-11-04  1115  		}
45aa39492cf4dd Bobby Eshleman 2025-11-04  1116  	}
45aa39492cf4dd Bobby Eshleman 2025-11-04  1117  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1118  	atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs);
45aa39492cf4dd Bobby Eshleman 2025-11-04  1119  
45aa39492cf4dd Bobby Eshleman 2025-11-04  1120  	return ret;
45aa39492cf4dd Bobby Eshleman 2025-11-04  1121  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
  2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
                     ` (2 preceding siblings ...)
  2025-11-06 15:11   ` Dan Carpenter
@ 2025-11-06 15:14   ` Dan Carpenter
  3 siblings, 0 replies; 18+ messages in thread
From: Dan Carpenter @ 2025-11-06 15:14 UTC (permalink / raw)
  To: oe-kbuild, Bobby Eshleman, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, Neal Cardwell, David Ahern, Arnd Bergmann,
	Jonathan Corbet, Andrew Lunn, Shuah Khan, Mina Almasry
  Cc: lkp, oe-kbuild-all, netdev, linux-kernel, linux-arch, linux-doc,
	linux-kselftest, Stanislav Fomichev, Bobby Eshleman

Hi Bobby,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-rename-tx_vec-to-vec-in-dmabuf-binding/20251105-092703
base:   255d75ef029f33f75fcf5015052b7302486f7ad2
patch link:    https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-3-ea98cf4d40b3%40meta.com
patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management
config: nios2-randconfig-r071-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060352.2kPPX3xE-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 8.5.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202511060352.2kPPX3xE-lkp@intel.com/

New smatch warnings:
net/ipv4/tcp.c:2626 tcp_recvmsg_dmabuf() error: uninitialized symbol 'refs'.

vim +/refs +2626 net/ipv4/tcp.c

8f0b3cc9a4c102 Mina Almasry       2024-09-10  2485  static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2486  			      unsigned int offset, struct msghdr *msg,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2487  			      int remaining_len)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2488  {
45aa39492cf4dd Bobby Eshleman     2025-11-04  2489  	struct net_devmem_dmabuf_binding *binding = NULL;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2490  	struct dmabuf_cmsg dmabuf_cmsg = { 0 };
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2491  	struct tcp_xa_pool tcp_xa_pool;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2492  	unsigned int start;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2493  	int i, copy, n;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2494  	int sent = 0;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2495  	int err = 0;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2496  	int refs;

refs needs to be initialized to zero.  Best practice is to use
CONFIG_INIT_STACK_ALL_PATTERN for testing.

8f0b3cc9a4c102 Mina Almasry       2024-09-10  2497  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2498  	tcp_xa_pool.max = 0;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2499  	tcp_xa_pool.idx = 0;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2500  	do {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2501  		start = skb_headlen(skb);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2502  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2503  		if (skb_frags_readable(skb)) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2504  			err = -ENODEV;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2505  			goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2506  		}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2507  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2508  		/* Copy header. */
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2509  		copy = start - offset;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2510  		if (copy > 0) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2511  			copy = min(copy, remaining_len);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2512  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2513  			n = copy_to_iter(skb->data + offset, copy,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2514  					 &msg->msg_iter);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2515  			if (n != copy) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2516  				err = -EFAULT;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2517  				goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2518  			}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2519  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2520  			offset += copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2521  			remaining_len -= copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2522  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2523  			/* First a dmabuf_cmsg for # bytes copied to user
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2524  			 * buffer.
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2525  			 */
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2526  			memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg));
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2527  			dmabuf_cmsg.frag_size = copy;
18912c520674ec Stanislav Fomichev 2025-02-24  2528  			err = put_cmsg_notrunc(msg, SOL_SOCKET,
18912c520674ec Stanislav Fomichev 2025-02-24  2529  					       SO_DEVMEM_LINEAR,
18912c520674ec Stanislav Fomichev 2025-02-24  2530  					       sizeof(dmabuf_cmsg),
18912c520674ec Stanislav Fomichev 2025-02-24  2531  					       &dmabuf_cmsg);
18912c520674ec Stanislav Fomichev 2025-02-24  2532  			if (err)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2533  				goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2534  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2535  			sent += copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2536  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2537  			if (remaining_len == 0)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2538  				goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2539  		}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2540  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2541  		/* after that, send information of dmabuf pages through a
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2542  		 * sequence of cmsg
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2543  		 */
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2544  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2545  			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2546  			struct net_iov *niov;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2547  			u64 frag_offset;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2548  			u32 token;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2549  			int end;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2550  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2551  			/* !skb_frags_readable() should indicate that ALL the
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2552  			 * frags in this skb are dmabuf net_iovs. We're checking
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2553  			 * for that flag above, but also check individual frags
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2554  			 * here. If the tcp stack is not setting
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2555  			 * skb_frags_readable() correctly, we still don't want
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2556  			 * to crash here.
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2557  			 */
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2558  			if (!skb_frag_net_iov(frag)) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2559  				net_err_ratelimited("Found non-dmabuf skb with net_iov");
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2560  				err = -ENODEV;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2561  				goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2562  			}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2563  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2564  			niov = skb_frag_net_iov(frag);
69e39537b66232 Pavel Begunkov     2025-02-04  2565  			if (!net_is_devmem_iov(niov)) {
69e39537b66232 Pavel Begunkov     2025-02-04  2566  				err = -ENODEV;
69e39537b66232 Pavel Begunkov     2025-02-04  2567  				goto out;
69e39537b66232 Pavel Begunkov     2025-02-04  2568  			}
69e39537b66232 Pavel Begunkov     2025-02-04  2569  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2570  			end = start + skb_frag_size(frag);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2571  			copy = end - offset;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2572  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2573  			if (copy > 0) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2574  				copy = min(copy, remaining_len);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2575  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2576  				frag_offset = net_iov_virtual_addr(niov) +
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2577  					      skb_frag_off(frag) + offset -
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2578  					      start;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2579  				dmabuf_cmsg.frag_offset = frag_offset;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2580  				dmabuf_cmsg.frag_size = copy;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2581  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2582  				binding = net_devmem_iov_binding(niov);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2583  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2584  				if (!sk->sk_devmem_info.binding)
45aa39492cf4dd Bobby Eshleman     2025-11-04  2585  					sk->sk_devmem_info.binding = binding;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2586  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2587  				if (sk->sk_devmem_info.binding != binding) {
45aa39492cf4dd Bobby Eshleman     2025-11-04  2588  					err = -EFAULT;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2589  					goto out;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2590  				}
45aa39492cf4dd Bobby Eshleman     2025-11-04  2591  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2592  				if (sk->sk_devmem_info.autorelease) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2593  					err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2594  								 skb_shinfo(skb)->nr_frags - i);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2595  					if (err)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2596  						goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2597  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2598  					dmabuf_cmsg.frag_token =
45aa39492cf4dd Bobby Eshleman     2025-11-04  2599  						tcp_xa_pool.tokens[tcp_xa_pool.idx];
45aa39492cf4dd Bobby Eshleman     2025-11-04  2600  				} else {
45aa39492cf4dd Bobby Eshleman     2025-11-04  2601  					token = net_iov_virtual_addr(niov) >> PAGE_SHIFT;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2602  					dmabuf_cmsg.frag_token = token;
45aa39492cf4dd Bobby Eshleman     2025-11-04  2603  				}
45aa39492cf4dd Bobby Eshleman     2025-11-04  2604  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2605  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2606  				/* Will perform the exchange later */
297d389e9e5bd4 Pavel Begunkov     2025-02-04  2607  				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2608  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2609  				offset += copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2610  				remaining_len -= copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2611  
18912c520674ec Stanislav Fomichev 2025-02-24  2612  				err = put_cmsg_notrunc(msg, SOL_SOCKET,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2613  						       SO_DEVMEM_DMABUF,
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2614  						       sizeof(dmabuf_cmsg),
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2615  						       &dmabuf_cmsg);
18912c520674ec Stanislav Fomichev 2025-02-24  2616  				if (err)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2617  					goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2618  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2619  				if (sk->sk_devmem_info.autorelease) {
45aa39492cf4dd Bobby Eshleman     2025-11-04  2620  					atomic_long_inc(&niov->pp_ref_count);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2621  					tcp_xa_pool.netmems[tcp_xa_pool.idx++] =
45aa39492cf4dd Bobby Eshleman     2025-11-04  2622  						skb_frag_netmem(frag);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2623  				} else {
45aa39492cf4dd Bobby Eshleman     2025-11-04  2624  					if (atomic_inc_return(&niov->uref) == 1)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2625  						atomic_long_inc(&niov->pp_ref_count);
45aa39492cf4dd Bobby Eshleman     2025-11-04 @2626  					refs++;
                                                                                        ^^^^^^

45aa39492cf4dd Bobby Eshleman     2025-11-04  2627  				}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2628  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2629  				sent += copy;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2630  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2631  				if (remaining_len == 0)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2632  					goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2633  			}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2634  			start = end;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2635  		}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2636  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2637  		tcp_xa_pool_commit(sk, &tcp_xa_pool);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2638  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2639  		if (!remaining_len)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2640  			goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2641  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2642  		/* if remaining_len is not satisfied yet, we need to go to the
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2643  		 * next frag in the frag_list to satisfy remaining_len.
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2644  		 */
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2645  		skb = skb_shinfo(skb)->frag_list ?: skb->next;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2646  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2647  		offset = offset - start;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2648  	} while (skb);
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2649  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2650  	if (remaining_len) {
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2651  		err = -EFAULT;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2652  		goto out;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2653  	}
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2654  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2655  out:
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2656  	tcp_xa_pool_commit(sk, &tcp_xa_pool);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2657  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2658  	if (!sent)
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2659  		sent = err;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2660  
45aa39492cf4dd Bobby Eshleman     2025-11-04  2661  	if (refs > 0)
45aa39492cf4dd Bobby Eshleman     2025-11-04  2662  		atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs);
45aa39492cf4dd Bobby Eshleman     2025-11-04  2663  
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2664  	return sent;
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2665  }
8f0b3cc9a4c102 Mina Almasry       2024-09-10  2666  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option
  2025-11-05 23:17         ` Stanislav Fomichev
@ 2025-11-07  2:22           ` Bobby Eshleman
  0 siblings, 0 replies; 18+ messages in thread
From: Bobby Eshleman @ 2025-11-07  2:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Mina Almasry, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	Neal Cardwell, David Ahern, Arnd Bergmann, Jonathan Corbet,
	Andrew Lunn, Shuah Khan, netdev, linux-kernel, linux-arch,
	linux-doc, linux-kselftest, Stanislav Fomichev, Bobby Eshleman

On Wed, Nov 05, 2025 at 03:17:06PM -0800, Stanislav Fomichev wrote:
> On 11/05, Stanislav Fomichev wrote:
> 
> Thank you for the context!
> 
> I think that the current approach is ok, we can go with that, but I
> wonder whether we can simplify things a bit? What if we prohibit the
> co-existence of autorelease=on and autorelease=off sockets on the
> system? The first binding basically locks the kernel path into one way or
> the other (presumably by using static-branch) and prohibits new bindings
> that use a different mode. It will let us still keep the mode on the binding
> and will help us not think about the co-existance (we can also still keep
> things like one-dmabuf-per-socket restrictions in the new mode, etc).
> 

That approach is okay by me.

Best,
Bobby

> I think for you, Mina, this should still work? You have a knob to go back
> to the old mode if needed. At the same time, we keep the UAPI surface
> smaller and keep the path more simple. Ideally, we can also deprecate
> the old mode at some point (if you manage to successfully migrate of
> course). WDYT?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-11-07  2:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05  1:23 [PATCH net-next v6 0/6] net: devmem: improve cpu cost of RX token management Bobby Eshleman
2025-11-05  1:23 ` [PATCH net-next v6 1/6] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
2025-11-05  1:23 ` [PATCH net-next v6 2/6] net: devmem: refactor sock_devmem_dontneed for autorelease split Bobby Eshleman
2025-11-05  1:23 ` [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management Bobby Eshleman
2025-11-05 16:02   ` kernel test robot
2025-11-05 20:55   ` kernel test robot
2025-11-06 15:11   ` Dan Carpenter
2025-11-06 15:14   ` Dan Carpenter
2025-11-05  1:23 ` [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control Bobby Eshleman
2025-11-05 17:16   ` kernel test robot
2025-11-05  1:23 ` [PATCH net-next v6 5/6] net: devmem: document SO_DEVMEM_AUTORELEASE socket option Bobby Eshleman
2025-11-05 17:34   ` Stanislav Fomichev
2025-11-05 17:44     ` Mina Almasry
2025-11-05 19:31       ` Stanislav Fomichev
2025-11-05 23:17         ` Stanislav Fomichev
2025-11-07  2:22           ` Bobby Eshleman
2025-11-05 17:59     ` Bobby Eshleman
2025-11-05  1:23 ` [PATCH net-next v6 6/6] net: devmem: add tests for " Bobby Eshleman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).