* [PATCH net-next v4 1/2] net: devmem: rename tx_vec to vec in dmabuf binding
2025-09-26 16:31 [PATCH net-next v4 0/2] net: devmem: improve cpu cost of RX token management Bobby Eshleman
@ 2025-09-26 16:31 ` Bobby Eshleman
2025-09-26 16:31 ` [PATCH net-next v4 2/2] net: devmem: use niov array for token management Bobby Eshleman
2025-09-27 6:00 ` [syzbot ci] Re: net: devmem: improve cpu cost of RX " syzbot ci
2 siblings, 0 replies; 7+ messages in thread
From: Bobby Eshleman @ 2025-09-26 16:31 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
David Ahern
Cc: netdev, linux-kernel, Stanislav Fomichev, Mina Almasry,
Bobby Eshleman
From: Bobby Eshleman <bobbyeshleman@meta.com>
Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'.
This field holds pointers to net_iov structures. The rename prepares for
reusing 'vec' for both TX and RX directions.
No functional change intended.
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
net/core/devmem.c | 22 +++++++++++-----------
net/core/devmem.h | 2 +-
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index d9de31a6cc7f..b4c570d4f37a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -74,7 +74,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
dma_buf_detach(binding->dmabuf, binding->attachment);
dma_buf_put(binding->dmabuf);
xa_destroy(&binding->bound_rxqs);
- kvfree(binding->tx_vec);
+ kvfree(binding->vec);
kfree(binding);
}
@@ -231,10 +231,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
}
if (direction == DMA_TO_DEVICE) {
- binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
- sizeof(struct net_iov *),
- GFP_KERNEL);
- if (!binding->tx_vec) {
+ binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+ sizeof(struct net_iov *),
+ GFP_KERNEL);
+ if (!binding->vec) {
err = -ENOMEM;
goto err_unmap;
}
@@ -248,7 +248,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
dev_to_node(&dev->dev));
if (!binding->chunk_pool) {
err = -ENOMEM;
- goto err_tx_vec;
+ goto err_vec;
}
virtual = 0;
@@ -294,7 +294,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
net_devmem_get_dma_addr(niov));
if (direction == DMA_TO_DEVICE)
- binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+ binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
}
virtual += len;
@@ -314,8 +314,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
gen_pool_for_each_chunk(binding->chunk_pool,
net_devmem_dmabuf_free_chunk_owner, NULL);
gen_pool_destroy(binding->chunk_pool);
-err_tx_vec:
- kvfree(binding->tx_vec);
+err_vec:
+ kvfree(binding->vec);
err_unmap:
dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt,
direction);
@@ -361,7 +361,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
int err = 0;
binding = net_devmem_lookup_dmabuf(dmabuf_id);
- if (!binding || !binding->tx_vec) {
+ if (!binding || !binding->vec) {
err = -EINVAL;
goto out_err;
}
@@ -393,7 +393,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding,
*off = virt_addr % PAGE_SIZE;
*size = PAGE_SIZE - *off;
- return binding->tx_vec[virt_addr / PAGE_SIZE];
+ return binding->vec[virt_addr / PAGE_SIZE];
}
/*** "Dmabuf devmem memory provider" ***/
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 101150d761af..2ada54fb63d7 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding {
* address. This array is convenient to map the virtual addresses to
* net_iovs in the TX path.
*/
- struct net_iov **tx_vec;
+ struct net_iov **vec;
struct work_struct unbind_w;
};
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH net-next v4 2/2] net: devmem: use niov array for token management
2025-09-26 16:31 [PATCH net-next v4 0/2] net: devmem: improve cpu cost of RX token management Bobby Eshleman
2025-09-26 16:31 ` [PATCH net-next v4 1/2] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
@ 2025-09-26 16:31 ` Bobby Eshleman
2025-09-26 23:22 ` Jakub Kicinski
2025-09-27 9:08 ` kernel test robot
2025-09-27 6:00 ` [syzbot ci] Re: net: devmem: improve cpu cost of RX " syzbot ci
2 siblings, 2 replies; 7+ messages in thread
From: Bobby Eshleman @ 2025-09-26 16:31 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, Neal Cardwell,
David Ahern
Cc: netdev, linux-kernel, Stanislav Fomichev, Mina Almasry,
Bobby Eshleman
From: Bobby Eshleman <bobbyeshleman@meta.com>
Improve CPU performance of devmem token management by using page offsets
as dmabuf tokens and using them for direct array access lookups instead
of xarray lookups. Consequently, the xarray can be removed. The result
is an average 5% reduction in CPU cycles spent by devmem RX user
threads.
This patch changes the meaning of tokens. Tokens previously referred to
unique fragments of pages. In this patch tokens instead represent
references to pages, not fragments. Because of this, multiple tokens may
refer to the same page and so have identical value (e.g., two small
fragments may coexist on the same page). The token and offset pair that
the user receives uniquely identifies fragments if needed. This assumes
that the user is not attempting to sort / uniq the token list using
tokens alone.
A new restriction is added to the implementation: devmem RX sockets
cannot switch dmabuf bindings. In practice, this is often a symptom of
invalid configuration as a flow would have to be steered to a different
queue or device where there is a different binding, which is generally
bad for TCP flows. This restriction is necessary because the 32-bit
dmabuf token does not have enough bits to represent both the pages in a
large dmabuf and also a binding or dmabuf ID. For example, a system with
8 NICs and 32 queues requires 8 bits for a binding / queue ID (8 NICs *
32 queues == 256 queues total == 2^8), which leaves only 24 bits for
dmabuf pages (2^24 * 4096 / (1<<30) == 64GB). This is insufficient for
the device and queue numbers on many current systems or systems that may
need larger GPU dmabufs (as for hard limits, my current H100 has 80GB
GPU memory per device).
Using kperf[1] with 4 flows and workers, this patch improves receive
worker CPU util by ~4.9% with slightly better throughput.
Before, mean cpu util for rx workers ~83.6%:
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 4 2.30 0.00 79.43 0.00 0.65 0.21 0.00 0.00 0.00 17.41
Average: 5 2.27 0.00 80.40 0.00 0.45 0.21 0.00 0.00 0.00 16.67
Average: 6 2.28 0.00 80.47 0.00 0.46 0.25 0.00 0.00 0.00 16.54
Average: 7 2.42 0.00 82.05 0.00 0.46 0.21 0.00 0.00 0.00 14.86
After, mean cpu util % for rx workers ~78.7%:
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 4 2.61 0.00 73.31 0.00 0.76 0.11 0.00 0.00 0.00 23.20
Average: 5 2.95 0.00 74.24 0.00 0.66 0.22 0.00 0.00 0.00 21.94
Average: 6 2.81 0.00 73.38 0.00 0.97 0.11 0.00 0.00 0.00 22.73
Average: 7 3.05 0.00 78.76 0.00 0.76 0.11 0.00 0.00 0.00 17.32
Mean throughput improves, but falls within a standard deviation (~45GB/s
for 4 flows on a 50GB/s NIC, one hop).
This patch adds an atomic to net_iov to count the number of outstanding
user references (uref) and tracks them via binding->vec. The
pp_ref_count is only incremented / decremented when uref goes from zero
to one or from one to zero, to avoid adding more atomic overhead. If a
user fails to refill and closes before returning all tokens, the binding
will finish the uref release when unbound. pp_ref_count cannot be used
directly because when the binding performs cleanup it does not know how
many pp_ref_count references are due to socket users.
[1]: https://github.com/facebookexperimental/kperf
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
footprint
- fallback to cleaning up references in dmabuf unbind if socket leaked
tokens
- drop ethtool patch
Changes in v2:
- always use GFP_ZERO for binding->vec (Mina)
- remove WARN for changed binding (Mina)
- remove extraneous binding ref get (Mina)
- remove WARNs on invalid user input (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- fix length of alloc for urefs
---
include/net/netmem.h | 1 +
include/net/sock.h | 4 +--
net/core/devmem.c | 34 ++++++++++++------
net/core/devmem.h | 2 +-
net/core/sock.c | 34 ++++++++++++------
net/ipv4/tcp.c | 94 +++++++++++-------------------------------------
net/ipv4/tcp_ipv4.c | 18 ++--------
net/ipv4/tcp_minisocks.c | 2 +-
8 files changed, 75 insertions(+), 114 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h
index f7dacc9e75fd..be6bc69c2f5a 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -116,6 +116,7 @@ struct net_iov {
};
struct net_iov_area *owner;
enum net_iov_type type;
+ atomic_t uref;
};
struct net_iov_area {
diff --git a/include/net/sock.h b/include/net/sock.h
index 8c5b64f41ab7..5dfeac963e66 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -347,7 +347,7 @@ struct sk_filter;
* @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
* @sk_scm_unused: unused flags for scm_recv()
* @ns_tracker: tracker for netns reference
- * @sk_user_frags: xarray of pages the user is holding a reference on.
+ * @sk_devmem_binding: the devmem binding used by the socket
* @sk_owner: reference to the real owner of the socket that calls
* sock_lock_init_class_and_name().
*/
@@ -574,7 +574,7 @@ struct sock {
struct numa_drop_counters *sk_drop_counters;
struct rcu_head sk_rcu;
netns_tracker ns_tracker;
- struct xarray sk_user_frags;
+ struct net_devmem_dmabuf_binding *sk_devmem_binding;
#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
struct module *sk_owner;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index b4c570d4f37a..865d8dee539f 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -11,6 +11,7 @@
#include <linux/genalloc.h>
#include <linux/mm.h>
#include <linux/netdevice.h>
+#include <linux/skbuff_ref.h>
#include <linux/types.h>
#include <net/netdev_queues.h>
#include <net/netdev_rx_queue.h>
@@ -120,6 +121,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
struct netdev_rx_queue *rxq;
unsigned long xa_idx;
unsigned int rxq_idx;
+ int i;
xa_erase(&net_devmem_dmabuf_bindings, binding->id);
@@ -142,6 +144,20 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
__net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
}
+ for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
+ struct net_iov *niov;
+ netmem_ref netmem;
+
+ niov = binding->vec[i];
+
+ if (!net_is_devmem_iov(niov))
+ continue;
+
+ netmem = net_iov_to_netmem(niov);
+ while (atomic_dec_and_test(&niov->uref))
+ WARN_ON_ONCE(!napi_pp_put_page(netmem));
+ }
+
net_devmem_dmabuf_binding_put(binding);
}
@@ -230,14 +246,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
goto err_detach;
}
- if (direction == DMA_TO_DEVICE) {
- binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
- sizeof(struct net_iov *),
- GFP_KERNEL);
- if (!binding->vec) {
- err = -ENOMEM;
- goto err_unmap;
- }
+ binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+ sizeof(struct net_iov *),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!binding->vec) {
+ err = -ENOMEM;
+ goto err_unmap;
}
/* For simplicity we expect to make PAGE_SIZE allocations, but the
@@ -291,10 +305,10 @@ net_devmem_bind_dmabuf(struct net_device *dev,
niov = &owner->area.niovs[i];
niov->type = NET_IOV_DMABUF;
niov->owner = &owner->area;
+ atomic_set(&niov->uref, 0);
page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
net_devmem_get_dma_addr(niov));
- if (direction == DMA_TO_DEVICE)
- binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
+ binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
}
virtual += len;
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 2ada54fb63d7..d4eb28d079bb 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -61,7 +61,7 @@ struct net_devmem_dmabuf_binding {
/* Array of net_iov pointers for this binding, sorted by virtual
* address. This array is convenient to map the virtual addresses to
- * net_iovs in the TX path.
+ * net_iovs.
*/
struct net_iov **vec;
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a..4ee10b4d1254 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -87,6 +87,7 @@
#include <linux/unaligned.h>
#include <linux/capability.h>
+#include <linux/dma-buf.h>
#include <linux/errno.h>
#include <linux/errqueue.h>
#include <linux/types.h>
@@ -151,6 +152,7 @@
#include <uapi/linux/pidfd.h>
#include "dev.h"
+#include "devmem.h"
static DEFINE_MUTEX(proto_list_mutex);
static LIST_HEAD(proto_list);
@@ -1082,6 +1084,7 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
struct dmabuf_token *tokens;
int ret = 0, num_frags = 0;
netmem_ref netmems[16];
+ struct net_iov *niov;
if (!sk_is_tcp(sk))
return -EBADF;
@@ -1100,34 +1103,43 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
return -EFAULT;
}
- xa_lock_bh(&sk->sk_user_frags);
for (i = 0; i < num_tokens; i++) {
for (j = 0; j < tokens[i].token_count; j++) {
+ struct net_iov *niov;
+ unsigned int token;
+ netmem_ref netmem;
+
+ token = tokens[i].token_start + j;
+ if (token >= sk->sk_devmem_binding->dmabuf->size / PAGE_SIZE)
+ break;
+
if (++num_frags > MAX_DONTNEED_FRAGS)
goto frag_limit_reached;
-
- netmem_ref netmem = (__force netmem_ref)__xa_erase(
- &sk->sk_user_frags, tokens[i].token_start + j);
+ niov = sk->sk_devmem_binding->vec[token];
+ netmem = net_iov_to_netmem(niov);
if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
continue;
netmems[netmem_num++] = netmem;
if (netmem_num == ARRAY_SIZE(netmems)) {
- xa_unlock_bh(&sk->sk_user_frags);
- for (k = 0; k < netmem_num; k++)
- WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+ for (k = 0; k < netmem_num; k++) {
+ niov = netmem_to_net_iov(netmems[k]);
+ if (atomic_dec_and_test(&niov->uref))
+ WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+ }
netmem_num = 0;
- xa_lock_bh(&sk->sk_user_frags);
}
ret++;
}
}
frag_limit_reached:
- xa_unlock_bh(&sk->sk_user_frags);
- for (k = 0; k < netmem_num; k++)
- WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+ for (k = 0; k < netmem_num; k++) {
+ niov = netmem_to_net_iov(netmems[k]);
+ if (atomic_dec_and_test(&niov->uref))
+ WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+ }
kvfree(tokens);
return ret;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7949d16506a4..700e5c32ed84 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -261,6 +261,7 @@
#include <linux/memblock.h>
#include <linux/highmem.h>
#include <linux/cache.h>
+#include <linux/dma-buf.h>
#include <linux/err.h>
#include <linux/time.h>
#include <linux/slab.h>
@@ -494,7 +495,7 @@ void tcp_init_sock(struct sock *sk)
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
sk_sockets_allocated_inc(sk);
- xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
+ sk->sk_devmem_binding = NULL;
}
EXPORT_IPV6_MOD(tcp_init_sock);
@@ -2406,68 +2407,6 @@ static int tcp_inq_hint(struct sock *sk)
return inq;
}
-/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */
-struct tcp_xa_pool {
- u8 max; /* max <= MAX_SKB_FRAGS */
- u8 idx; /* idx <= max */
- __u32 tokens[MAX_SKB_FRAGS];
- netmem_ref netmems[MAX_SKB_FRAGS];
-};
-
-static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
-{
- int i;
-
- /* Commit part that has been copied to user space. */
- for (i = 0; i < p->idx; i++)
- __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
- (__force void *)p->netmems[i], GFP_KERNEL);
- /* Rollback what has been pre-allocated and is no longer needed. */
- for (; i < p->max; i++)
- __xa_erase(&sk->sk_user_frags, p->tokens[i]);
-
- p->max = 0;
- p->idx = 0;
-}
-
-static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
-{
- if (!p->max)
- return;
-
- xa_lock_bh(&sk->sk_user_frags);
-
- tcp_xa_pool_commit_locked(sk, p);
-
- xa_unlock_bh(&sk->sk_user_frags);
-}
-
-static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
- unsigned int max_frags)
-{
- int err, k;
-
- if (p->idx < p->max)
- return 0;
-
- xa_lock_bh(&sk->sk_user_frags);
-
- tcp_xa_pool_commit_locked(sk, p);
-
- for (k = 0; k < max_frags; k++) {
- err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
- XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
- if (err)
- break;
- }
-
- xa_unlock_bh(&sk->sk_user_frags);
-
- p->max = k;
- p->idx = 0;
- return k ? 0 : err;
-}
-
/* On error, returns the -errno. On success, returns number of bytes sent to the
* user. May not consume all of @remaining_len.
*/
@@ -2476,14 +2415,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
int remaining_len)
{
struct dmabuf_cmsg dmabuf_cmsg = { 0 };
- struct tcp_xa_pool tcp_xa_pool;
unsigned int start;
int i, copy, n;
int sent = 0;
int err = 0;
- tcp_xa_pool.max = 0;
- tcp_xa_pool.idx = 0;
do {
start = skb_headlen(skb);
@@ -2530,8 +2466,12 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
*/
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ struct net_devmem_dmabuf_binding *binding;
struct net_iov *niov;
u64 frag_offset;
+ size_t size;
+ size_t len;
+ u32 token;
int end;
/* !skb_frags_readable() should indicate that ALL the
@@ -2564,13 +2504,21 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
start;
dmabuf_cmsg.frag_offset = frag_offset;
dmabuf_cmsg.frag_size = copy;
- err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
- skb_shinfo(skb)->nr_frags - i);
- if (err)
+
+ binding = net_devmem_iov_binding(niov);
+
+ if (!sk->sk_devmem_binding)
+ sk->sk_devmem_binding = binding;
+
+ if (sk->sk_devmem_binding != binding) {
+ err = -EFAULT;
goto out;
+ }
+
+ token = net_iov_virtual_addr(niov) >> PAGE_SHIFT;
+ dmabuf_cmsg.frag_token = token;
/* Will perform the exchange later */
- dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
offset += copy;
@@ -2583,8 +2531,8 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
if (err)
goto out;
- atomic_long_inc(&niov->pp_ref_count);
- tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
+ if (atomic_inc_return(&niov->uref) == 1)
+ atomic_long_inc(&niov->pp_ref_count);
sent += copy;
@@ -2594,7 +2542,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
start = end;
}
- tcp_xa_pool_commit(sk, &tcp_xa_pool);
if (!remaining_len)
goto out;
@@ -2612,7 +2559,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
}
out:
- tcp_xa_pool_commit(sk, &tcp_xa_pool);
if (!sent)
sent = err;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b1fcf3e4e1ce..a73424b88531 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -89,6 +89,9 @@
#include <crypto/hash.h>
#include <linux/scatterlist.h>
+#include <linux/dma-buf.h>
+#include "../core/devmem.h"
+
#include <trace/events/tcp.h>
#ifdef CONFIG_TCP_MD5SIG
@@ -2536,25 +2539,10 @@ static int tcp_v4_init_sock(struct sock *sk)
return 0;
}
-static void tcp_release_user_frags(struct sock *sk)
-{
-#ifdef CONFIG_PAGE_POOL
- unsigned long index;
- void *netmem;
-
- xa_for_each(&sk->sk_user_frags, index, netmem)
- WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
-#endif
-}
-
void tcp_v4_destroy_sock(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
- tcp_release_user_frags(sk);
-
- xa_destroy(&sk->sk_user_frags);
-
trace_tcp_destroy_sock(sk);
tcp_clear_xmit_timers(sk);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 2ec8c6f1cdcc..e006a3021db9 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -665,7 +665,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
- xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
+ newsk->sk_devmem_binding = NULL;
return newsk;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread* [syzbot ci] Re: net: devmem: improve cpu cost of RX token management
2025-09-26 16:31 [PATCH net-next v4 0/2] net: devmem: improve cpu cost of RX token management Bobby Eshleman
2025-09-26 16:31 ` [PATCH net-next v4 1/2] net: devmem: rename tx_vec to vec in dmabuf binding Bobby Eshleman
2025-09-26 16:31 ` [PATCH net-next v4 2/2] net: devmem: use niov array for token management Bobby Eshleman
@ 2025-09-27 6:00 ` syzbot ci
2 siblings, 0 replies; 7+ messages in thread
From: syzbot ci @ 2025-09-27 6:00 UTC (permalink / raw)
To: almasrymina, bobbyeshleman, bobbyeshleman, davem, dsahern,
edumazet, horms, kuba, kuniyu, linux-kernel, ncardwell, netdev,
pabeni, sdf, willemb
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v4] net: devmem: improve cpu cost of RX token management
https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com
* [PATCH net-next v4 1/2] net: devmem: rename tx_vec to vec in dmabuf binding
* [PATCH net-next v4 2/2] net: devmem: use niov array for token management
and found the following issue:
general protection fault in sock_devmem_dontneed
Full report is available here:
https://ci.syzbot.org/series/b8209bd4-e9f0-4c54-bad3-613e8431151b
***
general protection fault in sock_devmem_dontneed
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: dc1dea796b197aba2c3cae25bfef45f4b3ad46fe
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/b4d90fd9-9fbe-4e17-8fc0-3d6603df09da/config
C repro: https://ci.syzbot.org/findings/ce81b3c3-3db8-4643-9731-cbe331c65fdb/c_repro
syz repro: https://ci.syzbot.org/findings/ce81b3c3-3db8-4643-9731-cbe331c65fdb/syz_repro
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 5996 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:sock_devmem_dontneed+0x372/0x920 net/core/sock.c:1113
Code: 48 44 8b 20 44 89 74 24 54 45 01 f4 48 8b 44 24 60 42 80 3c 28 00 74 08 48 89 df e8 e8 5a c9 f8 4c 8b 33 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 74 08 4c 89 f7 e8 cf 5a c9 f8 4d 8b 3e 4c 89 f8 48
RSP: 0018:ffffc90002a1fac0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88810a8ab710 RCX: 1ffff11023002f45
RDX: ffff88801b339cc0 RSI: 0000000000002000 RDI: 0000000000000000
RBP: ffffc90002a1fc50 R08: ffffc90002a1fbdf R09: 0000000000000000
R10: ffffc90002a1fb60 R11: fffff52000543f7c R12: 0000000000f07000
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88810a8ab710
FS: 000055556f85a500(0000) GS:ffff8881a3c3d000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002000000a2000 CR3: 0000000024516000 CR4: 00000000000006f0
Call Trace:
<TASK>
sk_setsockopt+0x682/0x2dc0 net/core/sock.c:1304
do_sock_setsockopt+0x11b/0x1b0 net/socket.c:2340
__sys_setsockopt net/socket.c:2369 [inline]
__do_sys_setsockopt net/socket.c:2375 [inline]
__se_sys_setsockopt net/socket.c:2372 [inline]
__x64_sys_setsockopt+0x13f/0x1b0 net/socket.c:2372
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fea0438ec29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd04a8f368 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007fea045d5fa0 RCX: 00007fea0438ec29
RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000003
RBP: 00007fea04411e41 R08: 0000000000000010 R09: 0000000000000000
R10: 00002000000a2000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fea045d5fa0 R14: 00007fea045d5fa0 R15: 0000000000000005
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:sock_devmem_dontneed+0x372/0x920 net/core/sock.c:1113
Code: 48 44 8b 20 44 89 74 24 54 45 01 f4 48 8b 44 24 60 42 80 3c 28 00 74 08 48 89 df e8 e8 5a c9 f8 4c 8b 33 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 74 08 4c 89 f7 e8 cf 5a c9 f8 4d 8b 3e 4c 89 f8 48
RSP: 0018:ffffc90002a1fac0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88810a8ab710 RCX: 1ffff11023002f45
RDX: ffff88801b339cc0 RSI: 0000000000002000 RDI: 0000000000000000
RBP: ffffc90002a1fc50 R08: ffffc90002a1fbdf R09: 0000000000000000
R10: ffffc90002a1fb60 R11: fffff52000543f7c R12: 0000000000f07000
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88810a8ab710
FS: 000055556f85a500(0000) GS:ffff8881a3c3d000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002000000a2000 CR3: 0000000024516000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
0: 48 rex.W
1: 44 8b 20 mov (%rax),%r12d
4: 44 89 74 24 54 mov %r14d,0x54(%rsp)
9: 45 01 f4 add %r14d,%r12d
c: 48 8b 44 24 60 mov 0x60(%rsp),%rax
11: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)
16: 74 08 je 0x20
18: 48 89 df mov %rbx,%rdi
1b: e8 e8 5a c9 f8 call 0xf8c95b08
20: 4c 8b 33 mov (%rbx),%r14
23: 4c 89 f0 mov %r14,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) <-- trapping instruction
2f: 74 08 je 0x39
31: 4c 89 f7 mov %r14,%rdi
34: e8 cf 5a c9 f8 call 0xf8c95b08
39: 4d 8b 3e mov (%r14),%r15
3c: 4c 89 f8 mov %r15,%rax
3f: 48 rex.W
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 7+ messages in thread