public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode
@ 2026-04-15  8:26 Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
                   ` (13 more replies)
  0 siblings, 14 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Greetings, everyone. This is the batch feature series. Even though
net-next is closed, I would appreciate any feedbacks and suggestions
on this! Many thanks!

Bottom line up front: it improves the performance by 88.2% stably.

# Background
This series is focused on the performance improvement in copy mode. As
observed in the physical servers, there are much room left to ramp up
the transmission for copy mode, compared to zerocopy mode.

Even though we can apply zerocopy to achieve a much better performance,
some limitations are still there especially for virtio and veth cases
due to the implementation in the host. In the real world, hundreds and
thousands of hosts like at Tencent still don't support zerocopy mode
for VMs, so copy mode is the only way we can resort to. Being general
is its strong advantage.

Zerocopy has a good function name xskq_cons_read_desc_batch() which
reads descriptors in batch and then sends them out at a time, rather
than just read and send the descriptor one by one in a loop. Similar
batch ideas can be seen from classic mechanisms like GSO/GRO which
also try to handle as many packets as they can at one time. So the
motivation and idea of the series actually originated from them.

# AF_PACKET Comparison
Looking back to the initial design and implementation of AF_XDP, it's
not hard to find the big difference it made is to speed up the
transmission when zerocopy mode is enabled. So the conclusion is that
zerocopy mode of AF_XDP outperforms AF_PACKET that still uses copy mode.
As to the whole logic of copy mode for both of them, they looks quite
similar, especially when application using AF_PACKET sets
PACKET_QDISC_BYPASS option. Digging into the details of AF_PACKET, we
can find the implementation is comparatively heavy which can also be
proved by the real test as shown below. The numbers of AF_PACKET test
are a little bit lower.

# Batch Mode
At the current moment, I consider copy mode of AF_XDP as a half bypass
mechanism to some extent in comparison with the well known bypass
mechanism like DPDK. To avoid much consumption in kernel as much as
possible, then the batch xmit is proposed to aggregate descriptors in a
certain small group and then read/allocate/build/send them in individual
loops.

Applications are allowed to use setsockopt to enlarge the default value.
Please note that since memory allocation can be time consuming and heavy
due to lack of memory that results in complicated memory reclaim, it
might not be that good to hold one descriptor for too long, which brings
high latency for one skb.

# Experiments
Tested on ixgbe at 10Gb/sec with the following settings:
1. mitigations off
2. ethtool -G enp2s0f1 tx 512
3. sysctl -w net.core.skb_defer_max=0
4. sysctl -w net.core.wmem_max=21299200 and sndbuf is the same value
5. XDP_MAX_TX_SKB_BUDGET 512

taskset -c 1 ./xdpsock -i enp2s0f1 -t  -S -s 64

copy mode(before):          1,801,007 pps (baseline)
AF_PACKET:                  1,375,808 pps (-23.6%)
zc mode:                   13,333,593 pps (+640.3%)
batch mode(batch 1):        1,976,821 pps (+9.8%)
batch mode(batch 64):       3,389,704 pps (+88.2%)
batch mode(batch 256):      3,387,563 pps (+88.0%)

---
RFC v4
Link: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/
1. fix a few bugs in v3
2. add a few optimizations
The series is built on top of commit 2ce8a41113ed (net: hsr: emit
notification for PRP slave2 changed hw addr on port deletion). Since the
changes compared to v3 are too many, please review the series from scratch.
Thanks!

v3
Link: https://lore.kernel.org/all/20250825135342.53110-1-kerneljasonxing@gmail.com/
1. I retested and got different test numbers. Previous test is not that
right because my env has two NUMA nodes and only the first one has a
faster speed.
2. To achieve a stable performance result, the development and
evaluation are also finished in physical servers just like the numbers
that I share.
3. I didn't use pool->tx_descs because sockets can share the same umem
pool.
3. Use skb list to chain the allocated and built skbs to send.
5. Add AF_PACKET test numbers.

V2
Link: https://lore.kernel.org/all/20250811131236.56206-1-kerneljasonxing@gmail.com/
1. add xmit.more sub-feature (Jesper)
2. add kmem_cache_alloc_bulk (Jesper and Maciej)

Jason Xing (14):
  xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  xsk: extend xsk_build_skb() to support passing an already allocated
    skb
  xsk: add xsk_alloc_batch_skb() to build skbs in batch
  xsk: cache data buffers to avoid frequently calling kmalloc_reserve
  xsk: add direct xmit in batch function
  xsk: support dynamic xmit.more control for batch xmit
  xsk: try to skip validating skb list in xmit path
  xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch
  xsk: extend xskq_cons_read_desc_batch to count nb_pkts
  xsk: extend xsk_cq_reserve_locked() to reserve n slots
  xsk: support batch xmit main logic
  xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
  xsk: retire old xmit path in copy mode
  xsk: optimize xsk_build_skb for batch copy-mode fast path

 Documentation/networking/af_xdp.rst |  17 ++
 include/net/xdp_sock.h              |  17 ++
 include/net/xsk_buff_pool.h         |  10 +-
 include/uapi/linux/if_xdp.h         |   1 +
 net/core/dev.c                      |  49 +++++
 net/core/skbuff.c                   | 152 +++++++++++++++
 net/xdp/xsk.c                       | 279 ++++++++++++++++++++--------
 net/xdp/xsk_queue.h                 |  40 +++-
 tools/include/uapi/linux/if_xdp.h   |   1 +
 9 files changed, 473 insertions(+), 93 deletions(-)

-- 
2.41.3


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  8:51   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb Jason Xing
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Add a new socket option to provide an alternative to achieve a higher
overall throughput with the rest of series applied. As the corresponding
documentataion I added says, it might increase the latency because the
heavy allocation cannot be avoided especially when the shortage of
memory occurs. So this patch don't turn this feature as default.

Add generic_xmit_batch to tertermine how many descriptors are handled
at one time. It shouldn't be larger than max_tx_budget or smaller than
one that is the default value (disabling batch mode).

Introduce skb_cache when setting setsockopt with xs->mutex protection to
store newly allocated skbs at one time.

Introduce desc_cache to temporarily cache what descriptors the xsk is
about to send each round.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 Documentation/networking/af_xdp.rst | 17 +++++++++++
 include/net/xdp_sock.h              |  7 +++++
 include/uapi/linux/if_xdp.h         |  1 +
 net/xdp/xsk.c                       | 47 +++++++++++++++++++++++++++++
 tools/include/uapi/linux/if_xdp.h   |  1 +
 5 files changed, 73 insertions(+)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 50d92084a49c..7a8d219efe71 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -447,6 +447,23 @@ mode to allow application to tune the per-socket maximum iteration for
 better throughput and less frequency of send syscall.
 Allowed range is [32, xs->tx->nentries].
 
+XDP_GENERIC_XMIT_BATCH
+----------------------
+
+It provides an option that allows application to use batch xmit in the copy
+mode. Batch process tries to allocate a certain number skbs through bulk
+mechanism first and then initialize them and finally send them out at one
+time.
+It applies efficient bulk allocation/deallocation function, avoid frequently
+grabbing/releasing a few locks (like cache lock and queue lock), minimizing
+triggering IRQs from the driver side, which generally gain the overall
+performance improvement as observed by xdpsock benchmark.
+Potential side effect is that it might increase the latency of per packet
+due to memory allocation that is unavoidable and time-consuming.
+Setting a relatively large value of batch size could benifit for scenarios
+like bulk transmission. The maximum value shouldn't be larger than
+xs->max_tx_budget.
+
 XDP_STATISTICS getsockopt
 -------------------------
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 23e8861e8b25..965cab9a0465 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -45,6 +45,12 @@ struct xsk_map {
 	struct xdp_sock __rcu *xsk_map[];
 };
 
+struct xsk_batch {
+	u32 generic_xmit_batch;
+	struct sk_buff **skb_cache;
+	struct xdp_desc *desc_cache;
+};
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -89,6 +95,7 @@ struct xdp_sock {
 	struct mutex mutex;
 	struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
 	struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
+	struct xsk_batch batch;
 };
 
 /*
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 23a062781468..44cb72cd328e 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
 #define XDP_MAX_TX_SKB_BUDGET		9
+#define XDP_GENERIC_XMIT_BATCH		10
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..6122db8606fe 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1218,6 +1218,16 @@ static void xsk_delete_from_maps(struct xdp_sock *xs)
 	}
 }
 
+static void xsk_batch_reset(struct xsk_batch *batch, struct sk_buff **skbs,
+			    struct xdp_desc *descs, unsigned int size)
+{
+	kfree(batch->skb_cache);
+	kvfree(batch->desc_cache);
+	batch->skb_cache = skbs;
+	batch->desc_cache = descs;
+	batch->generic_xmit_batch = size;
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -1247,6 +1257,7 @@ static int xsk_release(struct socket *sock)
 	xskq_destroy(xs->tx);
 	xskq_destroy(xs->fq_tmp);
 	xskq_destroy(xs->cq_tmp);
+	xsk_batch_reset(&xs->batch, NULL, NULL, 0);
 
 	sock_orphan(sk);
 	sock->sk = NULL;
@@ -1588,6 +1599,42 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		WRITE_ONCE(xs->max_tx_budget, budget);
 		return 0;
 	}
+	case XDP_GENERIC_XMIT_BATCH:
+	{
+		struct xsk_buff_pool *pool = xs->pool;
+		struct xsk_batch *batch = &xs->batch;
+		struct xdp_desc *descs;
+		struct sk_buff **skbs;
+		unsigned int size;
+		int ret = 0;
+
+		if (optlen != sizeof(size))
+			return -EINVAL;
+		if (copy_from_sockptr(&size, optval, sizeof(size)))
+			return -EFAULT;
+		if (size == batch->generic_xmit_batch)
+			return 0;
+		if (!size || size > xs->max_tx_budget || !pool)
+			return -EACCES;
+
+		mutex_lock(&xs->mutex);
+		skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);
+		if (!skbs) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
+		if (!descs) {
+			kfree(skbs);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		xsk_batch_reset(batch, skbs, descs, size);
+out:
+		mutex_unlock(&xs->mutex);
+		return ret;
+	}
 	default:
 		break;
 	}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 23a062781468..44cb72cd328e 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
 #define XDP_MAX_TX_SKB_BUDGET		9
+#define XDP_GENERIC_XMIT_BATCH		10
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  8:52   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch Jason Xing
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

To avoid reinvent the wheel, the patch provides a way to let batch
feature reuse xsk_build_skb() as the rest process of the whole
initialization just after the skb is allocated.

The original xsk_build_skb() itself allocates a new skb by calling
sock_alloc_send_skb whether in copy mode or zerocopy mode. Add a new
parameter allocated skb to let other callers to pass an already
allocated skb to support later xmit batch feature. It replaces the
previous allocation of memory function with a bulk one.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  3 +++
 net/xdp/xsk.c          | 23 ++++++++++++++++-------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 965cab9a0465..90c709fd1239 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -133,6 +133,9 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 void __xsk_map_flush(struct list_head *flush_list);
 INDIRECT_CALLABLE_DECLARE(void xsk_destruct_skb(struct sk_buff *));
+struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+			      struct sk_buff *allocated_skb,
+			      struct xdp_desc *desc);
 
 /**
  *  xsk_tx_metadata_to_compl - Save enough relevant metadata information
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6122db8606fe..ecd5b9c424b8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -718,6 +718,7 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 }
 
 static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+					      struct sk_buff *allocated_skb,
 					      struct xdp_desc *desc)
 {
 	struct xsk_buff_pool *pool = xs->pool;
@@ -734,7 +735,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	if (!skb) {
 		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
 
-		skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+		if (!allocated_skb)
+			skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+		else
+			skb = allocated_skb;
 		if (unlikely(!skb))
 			return ERR_PTR(err);
 
@@ -799,15 +803,16 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	return skb;
 }
 
-static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
-				     struct xdp_desc *desc)
+struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+			      struct sk_buff *allocated_skb,
+			      struct xdp_desc *desc)
 {
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
 	int err;
 
 	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
-		skb = xsk_build_skb_zerocopy(xs, desc);
+		skb = xsk_build_skb_zerocopy(xs, allocated_skb, desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
 			skb = NULL;
@@ -822,8 +827,12 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 		if (!skb) {
 			hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
-			tr = dev->needed_tailroom;
-			skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+			if (!allocated_skb) {
+				tr = dev->needed_tailroom;
+				skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+			} else {
+				skb = allocated_skb;
+			}
 			if (unlikely(!skb))
 				goto free_err;
 
@@ -943,7 +952,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 			goto out;
 		}
 
-		skb = xsk_build_skb(xs, &desc);
+		skb = xsk_build_skb(xs, NULL, &desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
 			if (err != -EOVERFLOW)
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:17   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve Jason Xing
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Support allocating and building skbs in batch.

There are three steps for one batched allocation:
1. Reserve the skb and count the skb->truesize. It provides a way
   that for later patch to speed up small data transmission by
   diminishing the impact of kmalloc_reserve().
2. Add the total of truesize to sk_wmem_alloc at one time. The load and
   store of sk_wmem_alloc is time-consuming, so this batch process makes
   it gain the performance improvement.
3. Copy data and then finish initialization of each skb.

This patch uses kmem_cache_alloc_bulk() to complete the batch allocation
which relies on the global common cache 'net_hotdata.skbuff_cache'. Use
a xsk standalone skb cache (namely, xs->skb_cache) to store allocated
skbs instead of resorting to napi_alloc_cache that was designed for
softirq condition.

After allocating memory for each of skbs, in a 'for' loop, the patch
borrows part of __alloc_skb() to initialize skb and then calls
xsk_build_skb() to complete the rest of initialization process, like
copying data and stuff. To achieve a better result, the allocation
function only uses the function we need to keep it super clean, like
skb_set_owner_w() that is simplified into two lines of codes.

Add batch.send_queue and use the skb->list to make skbs into one chain
so that they can be easily sent which is shown in the subsequent patches.

In terms of freeing skbs process, napi_consume_skb() in the tx completion
would put the skb into global cache 'net_hotdata.skbuff_cache' that
implements the deferred freeing skb feature to avoid freeing skb one
by one to improve the performance.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |   3 +
 net/core/skbuff.c      | 121 +++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c          |   7 +++
 3 files changed, 131 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 90c709fd1239..84f0aee3fb10 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -47,8 +47,10 @@ struct xsk_map {
 
 struct xsk_batch {
 	u32 generic_xmit_batch;
+	unsigned int skb_count;
 	struct sk_buff **skb_cache;
 	struct xdp_desc *desc_cache;
+	struct sk_buff_head send_queue;
 };
 
 struct xdp_sock {
@@ -136,6 +138,7 @@ INDIRECT_CALLABLE_DECLARE(void xsk_destruct_skb(struct sk_buff *));
 struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			      struct sk_buff *allocated_skb,
 			      struct xdp_desc *desc);
+int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err);
 
 /**
  *  xsk_tx_metadata_to_compl - Save enough relevant metadata information
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4045d7c484a1..f29cecacd8bb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -83,6 +83,7 @@
 #include <net/psp/types.h>
 #include <net/dropreason.h>
 #include <net/xdp_sock.h>
+#include <net/xsk_buff_pool.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -647,6 +648,126 @@ static void *kmalloc_reserve(unsigned int *size, gfp_t flags, int node,
 	return obj;
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err)
+{
+	struct xsk_batch *batch = &xs->batch;
+	struct xdp_desc *descs = batch->desc_cache;
+	struct sk_buff **skbs = batch->skb_cache;
+	u32 alloc_descs, base_len, wmem, sndbuf;
+	gfp_t gfp_mask = xs->sk.sk_allocation;
+	u32 skb_count = batch->skb_count;
+	struct net_device *dev = xs->dev;
+	unsigned int total_truesize = 0;
+	struct sk_buff *skb = NULL;
+	int node = NUMA_NO_NODE;
+	u32 i = 0, j, k = 0;
+	bool need_alloc;
+	u8 *data;
+
+	base_len = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+	if (!(dev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+		base_len += dev->needed_tailroom;
+
+	if (xs->skb)
+		nb_pkts--;
+
+	if (skb_count >= nb_pkts)
+		goto alloc_data;
+
+	skb_count += kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
+					   gfp_mask,
+					   nb_pkts - skb_count,
+					   (void **)&skbs[skb_count]);
+	if (skb_count < nb_pkts)
+		nb_pkts = skb_count;
+
+alloc_data:
+	/*
+	 * Phase 1: Allocate data buffers and initialize SKBs.
+	 * Pre-scan descriptors to determine packet boundaries, so we can
+	 * batch the sk_wmem_alloc charge in Phase 2.
+	 */
+	need_alloc = !xs->skb;
+	wmem = sk_wmem_alloc_get(&xs->sk);
+	sndbuf = READ_ONCE(xs->sk.sk_sndbuf);
+	for (j = 0; j < nb_descs; j++) {
+		if (need_alloc) {
+			u32 size = base_len;
+
+			if (!(dev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+				size += descs[j].len;
+
+			if (i >= nb_pkts) {
+				*err = -EAGAIN;
+				break;
+			}
+
+			if (wmem + size + total_truesize > sndbuf) {
+				*err = -EAGAIN;
+				break;
+			}
+
+			skb = skbs[skb_count - 1 - i];
+			skbuff_clear(skb);
+			data = kmalloc_reserve(&size, gfp_mask, node, skb);
+			if (unlikely(!data)) {
+				*err = -ENOBUFS;
+				break;
+			}
+			__finalize_skb_around(skb, data, size);
+			/* Replace skb_set_owner_w() with the following */
+			skb->sk = &xs->sk;
+			skb->destructor = sock_wfree;
+			total_truesize += skb->truesize;
+			i++;
+			need_alloc = false;
+		}
+		if (!xp_mb_desc(&descs[j]))
+			need_alloc = true;
+	}
+	alloc_descs = j;
+
+	/*
+	 * Phase 2: Batch charge sk_wmem_alloc.
+	 * One refcount_add() replaces N per-SKB skb_set_owner_w() calls,
+	 * which gains much performance improvement.
+	 */
+	if (total_truesize)
+		refcount_add(total_truesize, &xs->sk.sk_wmem_alloc);
+
+	/* Phase 3: Build SKBs with packet data */
+	for (j = 0; j < alloc_descs; j++) {
+		if (!xs->skb) {
+			skb = skbs[skb_count - 1 - k];
+			k++;
+		}
+
+		skb = xsk_build_skb(xs, skb, &descs[j]);
+		if (IS_ERR(skb)) {
+			*err = PTR_ERR(skb);
+			break;
+		}
+
+		if (xp_mb_desc(&descs[j])) {
+			xs->skb = skb;
+			continue;
+		}
+
+		xs->skb = NULL;
+		__skb_queue_tail(&batch->send_queue, skb);
+	}
+
+	/* Phase 4: Reclaim unused allocated SKBs */
+	while (k < i)
+		kfree_skb(skbs[skb_count - 1 - k++]);
+
+	batch->skb_count = skb_count - i;
+
+	return j;
+}
+#endif
+
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ecd5b9c424b8..f97bc9cf9b9a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -25,6 +25,7 @@
 #include <linux/vmalloc.h>
 #include <net/xdp_sock_drv.h>
 #include <net/busy_poll.h>
+#include <net/hotdata.h>
 #include <net/netdev_lock.h>
 #include <net/netdev_rx_queue.h>
 #include <net/xdp.h>
@@ -1230,10 +1231,15 @@ static void xsk_delete_from_maps(struct xdp_sock *xs)
 static void xsk_batch_reset(struct xsk_batch *batch, struct sk_buff **skbs,
 			    struct xdp_desc *descs, unsigned int size)
 {
+	if (batch->skb_count)
+		kmem_cache_free_bulk(net_hotdata.skbuff_cache,
+				     batch->skb_count,
+				     (void **)batch->skb_cache);
 	kfree(batch->skb_cache);
 	kvfree(batch->desc_cache);
 	batch->skb_cache = skbs;
 	batch->desc_cache = descs;
+	batch->skb_count = 0;
 	batch->generic_xmit_batch = size;
 }
 
@@ -1946,6 +1952,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 
 	INIT_LIST_HEAD(&xs->map_list);
 	spin_lock_init(&xs->map_list_lock);
+	__skb_queue_head_init(&xs->batch.send_queue);
 
 	mutex_lock(&net->xdp.lock);
 	sk_add_node_rcu(sk, &net->xdp.list);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (2 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:38   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function Jason Xing
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

It's beneficial for small data transmission.

Replace per-SKB kmalloc_reserve() with on-demand bulk allocation from
skb_small_head_cache for small packets.

Add a persistent per-socket data buffer cache (batch.data_cache /
batch.data_count) that survives across batch cycles, similar to how
batch.send_queue caches built SKBs.

Inside the Phase-1 per-descriptor loop, when a small packet needs a
data buffer and the cache is empty, a single kmem_cache_alloc_bulk()
refills it with generic_xmit_batch objects. Subsequent small packets
pop directly from the cache. Large packets bypass the cache entirely
and fall back to kmalloc_reserve(). Unused buffers remain in the
cache for the next batch.

I observed that kmalloc_reserve() consumes nearly 40% which seems
unavoidable at the first glance, thinking adding the bulk mechanism
should contribute to the performance. That's the motivation of this
patch. Now, the feature gives us around 10% improvement.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  2 ++
 net/core/skbuff.c      | 27 ++++++++++++++++++++++-----
 net/xdp/xsk.c          | 24 ++++++++++++++++++++----
 3 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 84f0aee3fb10..2151aab8f0a1 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,8 @@ struct xsk_batch {
 	struct sk_buff **skb_cache;
 	struct xdp_desc *desc_cache;
 	struct sk_buff_head send_queue;
+	unsigned int data_count;
+	void **data_cache;
 };
 
 struct xdp_sock {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f29cecacd8bb..5726b1566b2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -661,9 +661,11 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
 	unsigned int total_truesize = 0;
 	struct sk_buff *skb = NULL;
 	int node = NUMA_NO_NODE;
+	void **dc = batch->data_cache;
+	unsigned int dc_count = batch->data_count;
 	u32 i = 0, j, k = 0;
 	bool need_alloc;
-	u8 *data;
+	void *data;
 
 	base_len = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
 	if (!(dev->priv_flags & IFF_TX_SKB_NO_LINEAR))
@@ -683,6 +685,13 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
 		nb_pkts = skb_count;
 
 alloc_data:
+	if (dc_count < nb_pkts && !(gfp_mask & KMALLOC_NOT_NORMAL_BITS))
+		dc_count += kmem_cache_alloc_bulk(
+				net_hotdata.skb_small_head_cache,
+				gfp_mask | __GFP_NOMEMALLOC | __GFP_NOWARN,
+				batch->generic_xmit_batch - dc_count,
+				&dc[dc_count]);
+
 	/*
 	 * Phase 1: Allocate data buffers and initialize SKBs.
 	 * Pre-scan descriptors to determine packet boundaries, so we can
@@ -710,10 +719,17 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
 
 			skb = skbs[skb_count - 1 - i];
 			skbuff_clear(skb);
-			data = kmalloc_reserve(&size, gfp_mask, node, skb);
-			if (unlikely(!data)) {
-				*err = -ENOBUFS;
-				break;
+			if (dc_count &&
+			    SKB_HEAD_ALIGN(size) <= SKB_SMALL_HEAD_CACHE_SIZE) {
+				data = dc[--dc_count];
+				size = SKB_SMALL_HEAD_CACHE_SIZE;
+			} else {
+				data = kmalloc_reserve(&size, gfp_mask,
+						       node, skb);
+				if (unlikely(!data)) {
+					*err = -ENOBUFS;
+					break;
+				}
 			}
 			__finalize_skb_around(skb, data, size);
 			/* Replace skb_set_owner_w() with the following */
@@ -762,6 +778,7 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
 	while (k < i)
 		kfree_skb(skbs[skb_count - 1 - k++]);
 
+	batch->data_count = dc_count;
 	batch->skb_count = skb_count - i;
 
 	return j;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f97bc9cf9b9a..7a6991bc19a8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1229,14 +1229,22 @@ static void xsk_delete_from_maps(struct xdp_sock *xs)
 }
 
 static void xsk_batch_reset(struct xsk_batch *batch, struct sk_buff **skbs,
-			    struct xdp_desc *descs, unsigned int size)
-{
+			    struct xdp_desc *descs, void **data,
+			    unsigned int size)
+{
+	if (batch->data_count)
+		kmem_cache_free_bulk(net_hotdata.skb_small_head_cache,
+				     batch->data_count,
+				     batch->data_cache);
+	kfree(batch->data_cache);
 	if (batch->skb_count)
 		kmem_cache_free_bulk(net_hotdata.skbuff_cache,
 				     batch->skb_count,
 				     (void **)batch->skb_cache);
 	kfree(batch->skb_cache);
 	kvfree(batch->desc_cache);
+	batch->data_cache = data;
+	batch->data_count = 0;
 	batch->skb_cache = skbs;
 	batch->desc_cache = descs;
 	batch->skb_count = 0;
@@ -1272,7 +1280,7 @@ static int xsk_release(struct socket *sock)
 	xskq_destroy(xs->tx);
 	xskq_destroy(xs->fq_tmp);
 	xskq_destroy(xs->cq_tmp);
-	xsk_batch_reset(&xs->batch, NULL, NULL, 0);
+	xsk_batch_reset(&xs->batch, NULL, NULL, NULL, 0);
 
 	sock_orphan(sk);
 	sock->sk = NULL;
@@ -1620,6 +1628,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		struct xsk_batch *batch = &xs->batch;
 		struct xdp_desc *descs;
 		struct sk_buff **skbs;
+		void **data;
 		unsigned int size;
 		int ret = 0;
 
@@ -1638,14 +1647,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			ret = -ENOMEM;
 			goto out;
 		}
+		data = kmalloc_array(size, sizeof(void *), GFP_KERNEL);
+		if (!data) {
+			kfree(skbs);
+			ret = -ENOMEM;
+			goto out;
+		}
 		descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
 		if (!descs) {
+			kfree(data);
 			kfree(skbs);
 			ret = -ENOMEM;
 			goto out;
 		}
 
-		xsk_batch_reset(batch, skbs, descs, size);
+		xsk_batch_reset(batch, skbs, descs, data, size);
 out:
 		mutex_unlock(&xs->mutex);
 		return ret;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (3 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:11   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit Jason Xing
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Add batch xmit logic.

Only grabbing the lock and disable bottom half once and sent all
the aggregated packets in one loop. Via skb->list, the already built
skbs can be handled one by one.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  1 +
 net/core/dev.c         | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 2151aab8f0a1..0609e3b04279 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -141,6 +141,7 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			      struct sk_buff *allocated_skb,
 			      struct xdp_desc *desc);
 int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err);
+int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev);
 
 /**
  *  xsk_tx_metadata_to_compl - Save enough relevant metadata information
diff --git a/net/core/dev.c b/net/core/dev.c
index 4519f0e59beb..e33a2406d8ca 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -163,6 +163,7 @@
 #include <net/page_pool/memory_provider.h>
 #include <net/rps.h>
 #include <linux/phy_link_topology.h>
+#include <net/xdp_sock.h>
 
 #include "dev.h"
 #include "devmem.h"
@@ -4893,6 +4894,46 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 }
 EXPORT_SYMBOL(__dev_queue_xmit);
 
+int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
+{
+	u16 queue_id = xs->queue_id;
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, queue_id);
+	struct sk_buff_head *send_queue = &xs->batch.send_queue;
+	int ret = NETDEV_TX_BUSY;
+	struct sk_buff *skb;
+
+	local_bh_disable();
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	while ((skb = __skb_dequeue(send_queue)) != NULL) {
+		struct sk_buff *orig_skb = skb;
+		bool again = false;
+
+		skb = validate_xmit_skb_list(skb, dev, &again);
+		if (skb != orig_skb) {
+			dev_core_stats_tx_dropped_inc(dev);
+			kfree_skb_list(skb);
+			ret = NET_XMIT_DROP;
+			break;
+		}
+
+		if (netif_xmit_frozen_or_drv_stopped(txq)) {
+			__skb_queue_head(send_queue, skb);
+			break;
+		}
+		skb_set_queue_mapping(skb, queue_id);
+		ret = netdev_start_xmit(skb, dev, txq, false);
+		if (ret != NETDEV_TX_OK) {
+			if (ret == NETDEV_TX_BUSY)
+				__skb_queue_head(send_queue, skb);
+			break;
+		}
+	}
+	HARD_TX_UNLOCK(dev, txq);
+	local_bh_enable();
+
+	return ret;
+}
+
 int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 {
 	struct net_device *dev = skb->dev;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (4 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:35   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path Jason Xing
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Only set xmit.more false for the last skb.

In theory, only making xmit.more false for the last packets to be
sent in each round can bring much benefit like avoid triggering too
many irqs.

Compared to the numbers for batch mode, a huge improvement (26%) can
be seen on i40e/ixgbe driver since the cost of triggering irqs is
expensive.

Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/core/dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e33a2406d8ca..a6abd621a7f3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4901,6 +4901,7 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
 	struct sk_buff_head *send_queue = &xs->batch.send_queue;
 	int ret = NETDEV_TX_BUSY;
 	struct sk_buff *skb;
+	bool more = true;
 
 	local_bh_disable();
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
@@ -4920,8 +4921,12 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
 			__skb_queue_head(send_queue, skb);
 			break;
 		}
+
+		if (!skb_peek(send_queue))
+			more = false;
+
 		skb_set_queue_mapping(skb, queue_id);
-		ret = netdev_start_xmit(skb, dev, txq, false);
+		ret = netdev_start_xmit(skb, dev, txq, more);
 		if (ret != NETDEV_TX_OK) {
 			if (ret == NETDEV_TX_BUSY)
 				__skb_queue_head(send_queue, skb);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (5 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:33   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 08/14] xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch Jason Xing
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch moves the SG check ahead, which is the only place we need to
handle very carefully because either in xsk_build_skb_zerocopy() or in
multi-buffer mode nr_frags (in skb_needs_linearize()) is used[1].

In most cases, for xsk, it's totally not needed to validate and check
the skb in validate_xmit_skb_list() that adds numerous checks in the
extremely hot path. In this kind of workload, even the overhead of
mathematical operations is not trivial.

Performance-wise, I run './xdpsock -i enp2s0f0np0 -t  -S -s 64' on 1Gb/sec
ixgbe driver to verify. It stably goes up by 5.48%

[1]: https://lore.kernel.org/all/20251125115754.46793-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/core/dev.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a6abd621a7f3..aa38993b9dd4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4899,6 +4899,7 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
 	u16 queue_id = xs->queue_id;
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, queue_id);
 	struct sk_buff_head *send_queue = &xs->batch.send_queue;
+	bool need_validate = !(dev->features & NETIF_F_SG);
 	int ret = NETDEV_TX_BUSY;
 	struct sk_buff *skb;
 	bool more = true;
@@ -4906,15 +4907,17 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
 	local_bh_disable();
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
 	while ((skb = __skb_dequeue(send_queue)) != NULL) {
-		struct sk_buff *orig_skb = skb;
-		bool again = false;
-
-		skb = validate_xmit_skb_list(skb, dev, &again);
-		if (skb != orig_skb) {
-			dev_core_stats_tx_dropped_inc(dev);
-			kfree_skb_list(skb);
-			ret = NET_XMIT_DROP;
-			break;
+		if (unlikely(need_validate)) {
+			struct sk_buff *orig_skb = skb;
+			bool again = false;
+
+			skb = validate_xmit_skb_list(skb, dev, &again);
+			if (skb != orig_skb) {
+				dev_core_stats_tx_dropped_inc(dev);
+				kfree_skb_list(skb);
+				ret = NET_XMIT_DROP;
+				break;
+			}
 		}
 
 		if (netif_xmit_frozen_or_drv_stopped(txq)) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 08/14] xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (6 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 09/14] xsk: extend xskq_cons_read_desc_batch to count nb_pkts Jason Xing
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Rename the last parameter to nb_descs for more accurate naming. Next
patch will add a real nb_pkts parameter to help copy mode count how
many pakcets are needed.

No functional change here.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7a6991bc19a8..6cd2e58e170c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -485,16 +485,16 @@ EXPORT_SYMBOL(xsk_tx_peek_desc);
 static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entries)
 {
 	struct xdp_desc *descs = pool->tx_descs;
-	u32 nb_pkts = 0;
+	u32 nb_descs = 0;
 
-	while (nb_pkts < max_entries && xsk_tx_peek_desc(pool, &descs[nb_pkts]))
-		nb_pkts++;
+	while (nb_descs < max_entries && xsk_tx_peek_desc(pool, &descs[nb_descs]))
+		nb_descs++;
 
 	xsk_tx_release(pool);
-	return nb_pkts;
+	return nb_descs;
 }
 
-u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_descs)
 {
 	struct xdp_sock *xs;
 
@@ -502,16 +502,16 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
-		return xsk_tx_peek_release_fallback(pool, nb_pkts);
+		return xsk_tx_peek_release_fallback(pool, nb_descs);
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
 	if (!xs) {
-		nb_pkts = 0;
+		nb_descs = 0;
 		goto out;
 	}
 
-	nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
+	nb_descs = xskq_cons_nb_entries(xs->tx, nb_descs);
 
 	/* This is the backpressure mechanism for the Tx path. Try to
 	 * reserve space in the completion queue for all packets, but
@@ -519,23 +519,23 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	 * packets. This avoids having to implement any buffering in
 	 * the Tx path.
 	 */
-	nb_pkts = xskq_prod_nb_free(pool->cq, nb_pkts);
-	if (!nb_pkts)
+	nb_descs = xskq_prod_nb_free(pool->cq, nb_descs);
+	if (!nb_descs)
 		goto out;
 
-	nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
-	if (!nb_pkts) {
+	nb_descs = xskq_cons_read_desc_batch(xs->tx, pool, nb_descs);
+	if (!nb_descs) {
 		xs->tx->queue_empty_descs++;
 		goto out;
 	}
 
 	__xskq_cons_release(xs->tx);
-	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
+	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_descs);
 	xs->sk.sk_write_space(&xs->sk);
 
 out:
 	rcu_read_unlock();
-	return nb_pkts;
+	return nb_descs;
 }
 EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
 
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 09/14] xsk: extend xskq_cons_read_desc_batch to count nb_pkts
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (7 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 08/14] xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 10/14] xsk: extend xsk_cq_reserve_locked() to reserve n slots Jason Xing
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Add a new parameter nb_pkts to count how many packets are needed
practically by copy mode with the help of XDP_PKT_CONTD option.

Add descs to provide a way to pass xs->desc_cache to store the
descriptors for copy mode.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk_queue.h | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index ec08d9c102b1..354f6fe86893 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -263,12 +263,12 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
 	parsed->mb = xp_mb_desc(desc);
 }
 
-static inline
-u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
-			      u32 max)
+static inline u32
+__xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+			    struct xdp_desc *descs, u32 max, u32 *nb_pkts,
+			    u32 max_segs)
 {
 	u32 cached_cons = q->cached_cons, nb_entries = 0;
-	struct xdp_desc *descs = pool->tx_descs;
 	u32 total_descs = 0, nr_frags = 0;
 
 	/* track first entry, if stumble upon *any* invalid descriptor, rewind
@@ -288,9 +288,11 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 		if (likely(!parsed.mb)) {
 			total_descs += (nr_frags + 1);
 			nr_frags = 0;
+			if (nb_pkts)
+				(*nb_pkts)++;
 		} else {
 			nr_frags++;
-			if (nr_frags == pool->xdp_zc_max_segs) {
+			if (nr_frags == max_segs) {
 				nr_frags = 0;
 				break;
 			}
@@ -304,6 +306,14 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 	return total_descs;
 }
 
+static inline u32
+xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+			  u32 max)
+{
+	return __xskq_cons_read_desc_batch(q, pool, pool->tx_descs, max,
+					   NULL, pool->xdp_zc_max_segs);
+}
+
 /* Functions for consumers */
 
 static inline void __xskq_cons_release(struct xsk_queue *q)
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 10/14] xsk: extend xsk_cq_reserve_locked() to reserve n slots
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (8 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 09/14] xsk: extend xskq_cons_read_desc_batch to count nb_pkts Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic Jason Xing
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Previously it only reserves one slot. The patch extends it to n to cover
the batch mode.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c       | 12 ++++++++----
 net/xdp/xsk_queue.h | 12 +++++++-----
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6cd2e58e170c..c26e26cb4dda 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -546,12 +546,17 @@ static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
 	return dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags);
 }
 
-static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
+/* The function tries to reserve as many descs as possible. If there
+ * is no single slot to allocate, return zero. Otherwise, return how
+ * many slots are available, even though it might stop reserving at
+ * certain point.
+ */
+static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool, u32 n)
 {
 	int ret;
 
 	spin_lock(&pool->cq->cq_cached_prod_lock);
-	ret = xskq_prod_reserve(pool->cq);
+	ret = xskq_prod_reserve(pool->cq, n);
 	spin_unlock(&pool->cq->cq_cached_prod_lock);
 
 	return ret;
@@ -947,8 +952,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		 * if there is space in it. This avoids having to implement
 		 * any buffering in the Tx path.
 		 */
-		err = xsk_cq_reserve_locked(xs->pool);
-		if (err) {
+		if (!xsk_cq_reserve_locked(xs->pool, 1)) {
 			err = -EAGAIN;
 			goto out;
 		}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 354f6fe86893..34cc07d6115e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -413,14 +413,16 @@ static inline void xskq_prod_cancel_n(struct xsk_queue *q, u32 cnt)
 	q->cached_prod -= cnt;
 }
 
-static inline int xskq_prod_reserve(struct xsk_queue *q)
+static inline int xskq_prod_reserve(struct xsk_queue *q, u32 n)
 {
-	if (xskq_prod_is_full(q))
-		return -ENOSPC;
+	u32 nr_free = xskq_prod_nb_free(q, n);
+
+	if (!nr_free)
+		return 0;
 
 	/* A, matches D */
-	q->cached_prod++;
-	return 0;
+	q->cached_prod += nr_free;
+	return nr_free;
 }
 
 static inline int xskq_prod_reserve_addr(struct xsk_queue *q, u64 addr)
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (9 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 10/14] xsk: extend xsk_cq_reserve_locked() to reserve n slots Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:38   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool Jason Xing
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This function __xsk_generic_xmit_batch() is the core function in batches
xmit, implement a batch version of __xsk_generic_xmit().

The whole logic is divided into sections:
1. check if we have enough available slots in tx ring and completion
   ring.
2. read descriptors from tx ring into pool->tx_descs in batches
3. reserve enough slots in completion ring to avoid backpressure
4. allocate and build skbs in batches
5. send all the possible packets in batches at one time

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c       | 116 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h |   8 +++
 2 files changed, 124 insertions(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c26e26cb4dda..e1ad2ac2b39a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -920,6 +920,122 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return ERR_PTR(err);
 }
 
+static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
+{
+	struct xsk_buff_pool *pool = xs->pool;
+	struct xsk_batch *batch = &xs->batch;
+	struct xdp_desc *descs = batch->desc_cache;
+	struct net_device *dev = xs->dev;
+	u32 max_batch, max_budget;
+	bool sent_frame = false;
+	struct sk_buff *skb;
+	u32 cons_descs;
+	int err = 0;
+	u32 i = 0;
+
+	mutex_lock(&xs->mutex);
+
+	/* Since we dropped the RCU read lock, the socket state might have changed. */
+	if (unlikely(!xsk_is_bound(xs))) {
+		err = -ENXIO;
+		goto out;
+	}
+
+	if (xs->queue_id >= dev->real_num_tx_queues) {
+		err = -ENXIO;
+		goto out;
+	}
+
+	if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev))) {
+		err = -ENETDOWN;
+		goto out;
+	}
+
+	max_budget = READ_ONCE(xs->max_tx_budget);
+	max_batch = batch->generic_xmit_batch;
+
+	for (i = 0; i < max_budget; i += cons_descs) {
+		u32 nb_pkts = 0;
+		u32 nb_descs;
+
+		nb_descs = min(max_batch, max_budget - i);
+		nb_descs = xskq_cons_nb_entries(xs->tx, nb_descs);
+		if (!nb_descs)
+			goto out;
+
+		/* This is the backpressure mechanism for the Tx path. Try to
+		 * reserve space in the completion queue for all packets, but
+		 * if there are fewer slots available, just process that many
+		 * packets. This avoids having to implement any buffering in
+		 * the Tx path.
+		 */
+		nb_descs = xsk_cq_reserve_locked(pool, nb_descs);
+		if (!nb_descs) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		cons_descs = xskq_cons_read_desc_batch_copy(xs->tx, pool, descs,
+							    nb_descs, &nb_pkts);
+		if (cons_descs < nb_descs) {
+			u32 delta = nb_descs - cons_descs;
+
+			xsk_cq_cancel_locked(pool, delta);
+			xs->tx->queue_empty_descs += delta;
+			if (!cons_descs) {
+				err = -EAGAIN;
+				goto out;
+			}
+			nb_descs = cons_descs;
+		}
+
+		cons_descs = xsk_alloc_batch_skb(xs, nb_pkts, nb_descs, &err);
+		/* Return 'nb_descs - cons_descs' number of descs to the
+		 * pool if the batch allocation partially fails
+		 */
+		if (cons_descs < nb_descs) {
+			xskq_cons_cancel_n(xs->tx, nb_descs - cons_descs);
+			xsk_cq_cancel_locked(pool, nb_descs - cons_descs);
+		}
+
+		if (!skb_queue_empty(&batch->send_queue)) {
+			int err_xmit;
+
+			err_xmit = xsk_direct_xmit_batch(xs, dev);
+			if (err_xmit == NETDEV_TX_BUSY)
+				err = -EAGAIN;
+			else if (err_xmit == NET_XMIT_DROP)
+				err = -EBUSY;
+
+			sent_frame = true;
+		}
+
+		if (err)
+			goto out;
+	}
+
+	/* Maximum budget of descriptors have been consumed */
+	if (xskq_has_descs(xs->tx))
+		err = -EAGAIN;
+
+out:
+	if (xs->skb)
+		xsk_drop_skb(xs->skb);
+
+	/* If send_queue has more pending skbs, we must to clear
+	 * the rest of them.
+	 */
+	while ((skb = __skb_dequeue(&batch->send_queue)) != NULL) {
+		xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb));
+		xsk_consume_skb(skb);
+	}
+	if (sent_frame)
+		__xsk_tx_release(xs);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
 static int __xsk_generic_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 34cc07d6115e..c3b97c6f2910 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -314,6 +314,14 @@ xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 					   NULL, pool->xdp_zc_max_segs);
 }
 
+static inline u32
+xskq_cons_read_desc_batch_copy(struct xsk_queue *q, struct xsk_buff_pool *pool,
+			       struct xdp_desc *descs, u32 max, u32 *nb_pkts)
+{
+	return __xskq_cons_read_desc_batch(q, pool, descs, max,
+					   nb_pkts, MAX_SKB_FRAGS);
+}
+
 /* Functions for consumers */
 
 static inline void __xskq_cons_release(struct xsk_queue *q)
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (10 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:20   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode Jason Xing
  2026-04-15  8:26 ` [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path Jason Xing
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

perf c2c profiling of the AF_XDP generic-copy batch TX path reveals
that ~45% of all cache-line contention (HITM) comes from a single
cacheline inside struct xsk_buff_pool.

The sendmsg CPU reads pool geometry fields (addrs, chunk_size,
headroom, tx_metadata_len, etc.) in the validate and build hot
path, while the NAPI TX-completion CPU writes cq_prod_lock (via
xsk_destruct_skb -> xsk_cq_submit_addr_locked) and
cached_need_wakeup (via xsk_set/clear_tx_need_wakeup) on the same
cacheline—classic false sharing.

This adds one extra cacheline (64 bytes) to the per-pool allocation
but eliminates cross-CPU false sharing between the TX sendmsg and
TX completion paths.

This reorganization improves overall performance by 5-6%, which can
be captured by xdpsock.

After this, the only one hotpot is 6% refcount process, which has
already been batched to minimize the impact in the series.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xsk_buff_pool.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ccb3b350001f..b1b11e3aa273 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -73,23 +73,27 @@ struct xsk_buff_pool {
 	u64 addrs_cnt;
 	u32 free_list_cnt;
 	u32 dma_pages_cnt;
-	u32 free_heads_cnt;
+
+	/* Read-mostly fields */
 	u32 headroom;
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
 	u32 xdp_zc_max_segs;
 	u8 tx_metadata_len; /* inherited from umem */
-	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool unaligned;
 	bool tx_sw_csum;
 	void *addrs;
+
+	/* Write-heavy fields */
 	/* Mutual exclusion of the completion ring in the SKB mode.
 	 * Protect: NAPI TX thread and sendmsg error paths in the SKB
 	 * destructor callback.
 	 */
-	spinlock_t cq_prod_lock;
+	spinlock_t cq_prod_lock ____cacheline_aligned_in_smp;
+	u8 cached_need_wakeup;
+	u32 free_heads_cnt;
 	struct xdp_buff_xsk *free_heads[];
 };
 
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (11 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:18   ` sashiko-bot
  2026-04-15  8:26 ` [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path Jason Xing
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Add a new helper xsk_init_batch() used in xsk_create() with the default
value 1.

Obsolete __xsk_generic_xmit.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 151 +++++++++++++-------------------------------------
 1 file changed, 37 insertions(+), 114 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e1ad2ac2b39a..be341290e42c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1036,101 +1036,14 @@ static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
 	return err;
 }
 
-static int __xsk_generic_xmit(struct sock *sk)
-{
-	struct xdp_sock *xs = xdp_sk(sk);
-	bool sent_frame = false;
-	struct xdp_desc desc;
-	struct sk_buff *skb;
-	u32 max_batch;
-	int err = 0;
-
-	mutex_lock(&xs->mutex);
-
-	/* Since we dropped the RCU read lock, the socket state might have changed. */
-	if (unlikely(!xsk_is_bound(xs))) {
-		err = -ENXIO;
-		goto out;
-	}
-
-	if (xs->queue_id >= xs->dev->real_num_tx_queues)
-		goto out;
-
-	max_batch = READ_ONCE(xs->max_tx_budget);
-	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-		if (max_batch-- == 0) {
-			err = -EAGAIN;
-			goto out;
-		}
-
-		/* This is the backpressure mechanism for the Tx path.
-		 * Reserve space in the completion queue and only proceed
-		 * if there is space in it. This avoids having to implement
-		 * any buffering in the Tx path.
-		 */
-		if (!xsk_cq_reserve_locked(xs->pool, 1)) {
-			err = -EAGAIN;
-			goto out;
-		}
-
-		skb = xsk_build_skb(xs, NULL, &desc);
-		if (IS_ERR(skb)) {
-			err = PTR_ERR(skb);
-			if (err != -EOVERFLOW)
-				goto out;
-			err = 0;
-			continue;
-		}
-
-		xskq_cons_release(xs->tx);
-
-		if (xp_mb_desc(&desc)) {
-			xs->skb = skb;
-			continue;
-		}
-
-		err = __dev_direct_xmit(skb, xs->queue_id);
-		if  (err == NETDEV_TX_BUSY) {
-			/* Tell user-space to retry the send */
-			xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb));
-			xsk_consume_skb(skb);
-			err = -EAGAIN;
-			goto out;
-		}
-
-		/* Ignore NET_XMIT_CN as packet might have been sent */
-		if (err == NET_XMIT_DROP) {
-			/* SKB completed but not sent */
-			err = -EBUSY;
-			xs->skb = NULL;
-			goto out;
-		}
-
-		sent_frame = true;
-		xs->skb = NULL;
-	}
-
-	if (xskq_has_descs(xs->tx)) {
-		if (xs->skb)
-			xsk_drop_skb(xs->skb);
-		xskq_cons_release(xs->tx);
-	}
-
-out:
-	if (sent_frame)
-		__xsk_tx_release(xs);
-
-	mutex_unlock(&xs->mutex);
-	return err;
-}
-
 static int xsk_generic_xmit(struct sock *sk)
 {
+	struct xdp_sock *xs = xdp_sk(sk);
 	int ret;
 
 	/* Drop the RCU lock since the SKB path might sleep. */
 	rcu_read_unlock();
-	ret = __xsk_generic_xmit(sk);
+	ret = __xsk_generic_xmit_batch(xs);
 	/* Reaquire RCU lock before going into common code. */
 	rcu_read_lock();
 
@@ -1626,6 +1539,34 @@ struct xdp_umem_reg_v1 {
 	__u32 headroom;
 };
 
+static int xsk_init_batch(struct xsk_batch *batch, unsigned int size)
+{
+	struct xdp_desc *descs;
+	struct sk_buff **skbs;
+	void **data;
+
+	skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	data = kmalloc_array(size, sizeof(void *), GFP_KERNEL);
+	if (!data) {
+		kfree(skbs);
+		return -ENOMEM;
+	}
+
+	descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
+	if (!descs) {
+		kfree(data);
+		kfree(skbs);
+		return -ENOMEM;
+	}
+
+	xsk_batch_reset(batch, skbs, descs, data, size);
+
+	return 0;
+}
+
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  sockptr_t optval, unsigned int optlen)
 {
@@ -1746,9 +1687,6 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	{
 		struct xsk_buff_pool *pool = xs->pool;
 		struct xsk_batch *batch = &xs->batch;
-		struct xdp_desc *descs;
-		struct sk_buff **skbs;
-		void **data;
 		unsigned int size;
 		int ret = 0;
 
@@ -1762,27 +1700,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EACCES;
 
 		mutex_lock(&xs->mutex);
-		skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);
-		if (!skbs) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		data = kmalloc_array(size, sizeof(void *), GFP_KERNEL);
-		if (!data) {
-			kfree(skbs);
-			ret = -ENOMEM;
-			goto out;
-		}
-		descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
-		if (!descs) {
-			kfree(data);
-			kfree(skbs);
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		xsk_batch_reset(batch, skbs, descs, data, size);
-out:
+		ret = xsk_init_batch(batch, size);
 		mutex_unlock(&xs->mutex);
 		return ret;
 	}
@@ -2056,6 +1974,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 {
 	struct xdp_sock *xs;
 	struct sock *sk;
+	int ret;
 
 	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		return -EPERM;
@@ -2071,6 +1990,11 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 	if (!sk)
 		return -ENOBUFS;
 
+	xs = xdp_sk(sk);
+	ret = xsk_init_batch(&xs->batch, 1);
+	if (ret)
+		return ret;
+
 	sock->ops = &xsk_proto_ops;
 
 	sock_init_data(sock, sk);
@@ -2081,7 +2005,6 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 
 	sock_set_flag(sk, SOCK_RCU_FREE);
 
-	xs = xdp_sk(sk);
 	xs->state = XSK_READY;
 	xs->max_tx_budget = TX_BATCH_SIZE;
 	mutex_init(&xs->mutex);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path
  2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
                   ` (12 preceding siblings ...)
  2026-04-15  8:26 ` [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode Jason Xing
@ 2026-04-15  8:26 ` Jason Xing
  2026-04-15  9:47   ` sashiko-bot
  13 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-15  8:26 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Three targeted optimizations for the batch copy-mode TX hot path:

Replace skb_store_bits() with memcpy() for single-buffer first-desc
path.  After skb_reserve() + skb_put(), the SKB is freshly allocated
with all data in the linear area and no frags, so skb_store_bits()
degenerates to memcpy(skb->data, buffer, len) but carries unnecessary
function call overhead, offset validation, and frag iteration logic.

Inline UMEM address computation in Phase 3 and pass the pre-computed
buffer pointer to xsk_build_skb(), avoiding the per-packet non-inlined
xp_raw_get_data() (EXPORT_SYMBOL) call chain:
xsk_buff_raw_get_data -> xp_raw_get_data -> __xp_raw_get_addr +
__xp_raw_get_data.
In the batch loop the pool->addrs and pool->unaligned are invariant,
so we cache them once and compute each buffer address inline.

Prefetch the *next* descriptor's UMEM data buffer at the top of the
Phase 3 loop, hiding the memory latency of the upcoming memcpy.

It improves 3-4% performance stably.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  3 ++-
 net/core/skbuff.c      | 18 ++++++++++++++++--
 net/xdp/xsk.c          | 15 ++++++---------
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 0609e3b04279..5e05236c7fba 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -139,7 +139,8 @@ void __xsk_map_flush(struct list_head *flush_list);
 INDIRECT_CALLABLE_DECLARE(void xsk_destruct_skb(struct sk_buff *));
 struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			      struct sk_buff *allocated_skb,
-			      struct xdp_desc *desc);
+			      struct xdp_desc *desc,
+			      void *buffer);
 int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err);
 int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5726b1566b2b..bef5270e6332 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -752,14 +752,28 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
 	if (total_truesize)
 		refcount_add(total_truesize, &xs->sk.sk_wmem_alloc);
 
-	/* Phase 3: Build SKBs with packet data */
+	/* Phase 3: Build SKBs with packet data. */
+	struct xsk_buff_pool *pool = xs->pool;
+	void *pool_addrs = pool->addrs;
+	bool unaligned = pool->unaligned;
+
 	for (j = 0; j < alloc_descs; j++) {
+		u64 addr = descs[j].addr;
+		void *buffer;
+
+		if (unaligned)
+			addr = xp_unaligned_add_offset_to_addr(addr);
+		buffer = pool_addrs + addr;
+
+		if (j + 1 < alloc_descs)
+			prefetch(pool_addrs + descs[j + 1].addr);
+
 		if (!xs->skb) {
 			skb = skbs[skb_count - 1 - k];
 			k++;
 		}
 
-		skb = xsk_build_skb(xs, skb, &descs[j]);
+		skb = xsk_build_skb(xs, skb, &descs[j], buffer);
 		if (IS_ERR(skb)) {
 			*err = PTR_ERR(skb);
 			break;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index be341290e42c..3bf81b838075 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -811,7 +811,8 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 
 struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			      struct sk_buff *allocated_skb,
-			      struct xdp_desc *desc)
+			      struct xdp_desc *desc,
+			      void *buffer)
 {
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
@@ -825,11 +826,10 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			goto free_err;
 		}
 	} else {
-		u32 hr, tr, len;
-		void *buffer;
+		u32 hr, tr, len = desc->len;
 
-		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
-		len = desc->len;
+		if (!buffer)
+			buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
 
 		if (!skb) {
 			hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
@@ -844,10 +844,7 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 			skb_reserve(skb, hr);
 			skb_put(skb, len);
-
-			err = skb_store_bits(skb, 0, buffer, len);
-			if (unlikely(err))
-				goto free_err;
+			memcpy(skb->data, buffer, len);
 
 			xsk_skb_init_misc(skb, xs, desc->addr);
 			if (desc->options & XDP_TX_METADATA) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2026-04-15  8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
@ 2026-04-15  8:51   ` sashiko-bot
  2026-04-15 12:46     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  8:51 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] Unsafe array allocation using kmalloc for a potentially large, user-controlled size
--

commit 16fdf26210767cd884154d2717d7894d96a2ee54
Author: Jason Xing <kernelxing@tencent.com>
xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt

This commit introduces the XDP_GENERIC_XMIT_BATCH setsockopt to allow
batch processing in copy mode, attempting to improve throughput by 
allocating and initializing multiple skbs at one time.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 3b46bc635c432..1194268ae8676 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
[ ... ]
> @@ -1582,6 +1593,42 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		WRITE_ONCE(xs->max_tx_budget, budget);
>  		return 0;
>  	}
> +	case XDP_GENERIC_XMIT_BATCH:
> +	{
> +		struct xsk_buff_pool *pool = xs->pool;
> +		struct xsk_batch *batch = &xs->batch;
> +		struct xdp_desc *descs;
> +		struct sk_buff **skbs;
> +		unsigned int size;
> +		int ret = 0;
> +
> +		if (optlen != sizeof(size))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&size, optval, sizeof(size)))
> +			return -EFAULT;
> +		if (size == batch->generic_xmit_batch)
> +			return 0;
> +		if (!size || size > xs->max_tx_budget || !pool)
> +			return -EACCES;
> +
> +		mutex_lock(&xs->mutex);
> +		skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);

Can the requested size here become large enough to trigger high-order
allocation failures under memory fragmentation?

Since size is bounded by xs->max_tx_budget, which follows the TX ring
entries, it looks like this could reach up to 65536 on some configurations.
A 512KB contiguous allocation via kmalloc might fail on long-running
systems and trigger page allocation warnings in the kernel log.

Also, since size is a user-provided value bounded only by the maximum
budget, could the open-coded multiplication overflow on 32-bit platforms?

Would it be safer to use kvmalloc_array() here to prevent both integer
overflows and the dependency on physically contiguous memory, similar to
how kvcalloc() is used for the descs array right below it?

> +		if (!skbs) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb
  2026-04-15  8:26 ` [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb Jason Xing
@ 2026-04-15  8:52   ` sashiko-bot
  2026-04-15 13:19     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  8:52 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] Inconsistent ownership semantics of `allocated_skb` on error paths lead to a memory leak in the zero-copy path.
--

commit a6016c1c59927d7e151f38b7f237ad699db86b1e
Author: Jason Xing <kernelxing@tencent.com>

xsk: extend xsk_build_skb() to support passing an already allocated skb

This patch extends xsk_build_skb() to accept an allocated_skb parameter,
allowing the batch feature to reuse an already allocated skb instead of
allocating a new one.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 1194268ae8676..becc8ebfd82f8 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[ ... ]

> @@ -793,15 +797,16 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  	return skb;
>  }
>  
> -static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> -				     struct xdp_desc *desc)
> +struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> +			      struct sk_buff *allocated_skb,
> +			      struct xdp_desc *desc)
>  {
>  	struct net_device *dev = xs->dev;
>  	struct sk_buff *skb = xs->skb;
>  	int err;
>  
>  	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> -		skb = xsk_build_skb_zerocopy(xs, desc);
> +		skb = xsk_build_skb_zerocopy(xs, allocated_skb, desc);
>  		if (IS_ERR(skb)) {
>  			err = PTR_ERR(skb);
>  			skb = NULL;
>  			goto free_err;

Does this leak allocated_skb on the zero-copy error path?

If xsk_build_skb_zerocopy() fails on the first descriptor (e.g., if
xsk_skb_metadata() returns an error due to invalid metadata), it returns an
ERR_PTR without freeing the SKB.

Because skb is then explicitly set to NULL before jumping to free_err,
kfree_skb() is skipped.

While the original code also had a leak here for internally allocated SKBs,
this patch extends the issue to caller-provided SKBs. In copy mode,
xsk_build_skb() correctly frees allocated_skb on error, but the zero-copy
path does not.

Could this lead to memory exhaustion if a caller assumes xsk_build_skb()
always consumes or frees the provided SKB on error?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function
  2026-04-15  8:26 ` [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function Jason Xing
@ 2026-04-15  9:11   ` sashiko-bot
  2026-04-16  3:04     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:11 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 4 potential issue(s):
- [High] Packet validation executed inside hardware transmit lock, causing severe lock contention
- [High] Missing transmit recursion protection introduces deadlock vectors
- [High] Missing device state validation allows transmission to stopped or suspended devices
- [Medium] Hardware doorbell batching is defeated by hardcoded 'more' parameter
--

commit b98984346f64d24e3ecd521b219464ccbe9fba91
Author: Jason Xing <kernelxing@tencent.com>

xsk: add direct xmit in batch function

This commit adds a batch transmission logic for AF_XDP. It introduces a
new direct transmit function that disables bottom halves, grabs the lock
once, and sends all aggregated packets in a loop.

> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c

[ ... ]

> +int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
> +{
> +	u16 queue_id = xs->queue_id;
> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, queue_id);
> +	struct sk_buff_head *send_queue = &xs->batch.send_queue;
> +	int ret = NETDEV_TX_BUSY;
> +	struct sk_buff *skb;
> +

Should there be a check for the device state before attempting to transmit?
Direct transmit paths that bypass the Qdisc layer, like __dev_direct_xmit(),
typically verify that the device is running and the carrier is up:

	if (unlikely(!netif_running(dev) ||
		     !netif_carrier_ok(dev)))
		goto drop;

Could invoking transmission on a stopped or suspended device cause issues?

> +	local_bh_disable();

Are we missing transmit recursion protection here? Standard paths wrap the
hardware transmit lock with recursion tracking to protect against stack
overflows and deadlocks when routing through virtual devices:

	dev_xmit_recursion_inc();
	HARD_TX_LOCK(dev, txq, smp_processor_id());

> +	HARD_TX_LOCK(dev, txq, smp_processor_id());
> +	while ((skb = __skb_dequeue(send_queue)) != NULL) {
> +		struct sk_buff *orig_skb = skb;
> +		bool again = false;
> +
> +		skb = validate_xmit_skb_list(skb, dev, &again);

Will executing validate_xmit_skb_list() inside the hardware transmit lock
cause lock contention? This function can perform computationally expensive
operations such as GSO segmentation, software checksum fallbacks, and
subsystem hooks. Other transmit paths execute this validation strictly
before acquiring the hardware lock.

> +		if (skb != orig_skb) {
> +			dev_core_stats_tx_dropped_inc(dev);
> +			kfree_skb_list(skb);
> +			ret = NET_XMIT_DROP;
> +			break;
> +		}
> +
> +		if (netif_xmit_frozen_or_drv_stopped(txq)) {
> +			__skb_queue_head(send_queue, skb);
> +			break;
> +		}
> +		skb_set_queue_mapping(skb, queue_id);
> +		ret = netdev_start_xmit(skb, dev, txq, false);

Does passing false for the more parameter defeat the purpose of hardware
doorbell batching? It instructs the driver to ring the hardware doorbell for
every single packet instead of delaying it until the end of the batch.

It looks like this might be fixed later in the patch series by commit
6d7c8f2ad056e ("xsk: support dynamic xmit.more control for batch xmit").

> +		if (ret != NETDEV_TX_OK) {
> +			if (ret == NETDEV_TX_BUSY)
> +				__skb_queue_head(send_queue, skb);
> +			break;
> +		}
> +	}
> +	HARD_TX_UNLOCK(dev, txq);

Also, we would need to pair any added recursion increment with a decrement
here:

	dev_xmit_recursion_dec();

> +	local_bh_enable();
> +
> +	return ret;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch
  2026-04-15  8:26 ` [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch Jason Xing
@ 2026-04-15  9:17   ` sashiko-bot
  2026-04-16  1:18     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:17 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [High] Pre-allocated SKB and socket memory charge (`sk_wmem_alloc`) are permanently leaked when zerocopy metadata processing fails.
- [High] Double-free and Use-After-Free of `xs->skb` when fragment allocation fails with `-EAGAIN` in copy mode.
--

commit 3ef2c03b6204759327ab337fe34617d9dd8e04ef
Author: Jason Xing <kernelxing@tencent.com>

xsk: add xsk_alloc_batch_skb() to build skbs in batch

This commit introduces xsk_alloc_batch_skb() to support allocating and
building skbs in batch. It reserves skbs, batches the sk_wmem_alloc charge,
and then copies data to complete initialization.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c

[ ... ]

> +	/* Phase 3: Build SKBs with packet data */
> +	for (j = 0; j < alloc_descs; j++) {
> +		if (!xs->skb) {
> +			skb = skbs[skb_count - 1 - k];
> +			k++;
> +		}
> +
> +		skb = xsk_build_skb(xs, skb, &descs[j]);
> +		if (IS_ERR(skb)) {
> +			*err = PTR_ERR(skb);
> +			break;
> +		}

Can this lead to a double-free and use-after-free of xs->skb if a fragment
allocation fails?

If xsk_build_skb() fails to allocate a page for a subsequent fragment in copy
mode, it frees the partially built skb and returns -EAGAIN. However, it
leaves xs->skb pointing to this freed memory.

When this loop breaks and returns -EAGAIN up the call chain, the caller
__xsk_generic_xmit_batch() checks xs->skb and calls xsk_drop_skb(xs->skb),
which would double-free the skb. Furthermore, any subsequent transmissions
could dereference the dangling pointer. Should xs->skb be explicitly cleared
when handling this error?

> +
> +		if (xp_mb_desc(&descs[j])) {
> +			xs->skb = skb;
> +			continue;
> +		}
> +
> +		xs->skb = NULL;
> +		__skb_queue_tail(&batch->send_queue, skb);
> +	}
> +
> +	/* Phase 4: Reclaim unused allocated SKBs */
> +	while (k < i)
> +		kfree_skb(skbs[skb_count - 1 - k++]);

Does this permanently leak the pre-allocated skb and its socket memory charge
if xsk_build_skb() fails?

When xsk_build_skb() encounters an error like a zerocopy metadata processing
failure, it returns an ERR_PTR() without freeing the skb. Because k was
already incremented before the xsk_build_skb() call, the Phase 4 loop starts
reclaiming at the next skb.

The failed skb at index k - 1 is bypassed. Since it is never freed,
sock_wfree is skipped, leaking both the skb and the batched sk_wmem_alloc
charge.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode
  2026-04-15  8:26 ` [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode Jason Xing
@ 2026-04-15  9:18   ` sashiko-bot
  2026-04-16 10:33     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:18 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 3 potential issue(s):
- [Critical] Time-of-Check to Time-of-Use (TOCTOU) Use-After-Free (UAF) vulnerability on `batch->desc_cache`.
- [High] Leaked socket structure on error path in `xsk_create()`.
- [High] Unbounded `kmalloc` size for user-controlled batch size triggers `WARN_ON` and potential panic.
--

commit 5f2cebbff2efe715b882067f337741749751ac65
Author: Jason Xing <kernelxing@tencent.com>

xsk: retire old xmit path in copy mode

This commit adds a new helper xsk_init_batch() with a default value of 1,
and obsoletes __xsk_generic_xmit() by substituting it with
__xsk_generic_xmit_batch() in the generic transmit path.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 16e47f6fc30ec..232e7bf96b342 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
[ ... ]
> @@ -1030,101 +1030,14 @@ static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
>  	return err;
>  }
>  
[ ... ]
>  static int xsk_generic_xmit(struct sock *sk)
>  {
> +	struct xdp_sock *xs = xdp_sk(sk);
>  	int ret;
>  
>  	/* Drop the RCU lock since the SKB path might sleep. */
>  	rcu_read_unlock();
> -	ret = __xsk_generic_xmit(sk);
> +	ret = __xsk_generic_xmit_batch(xs);

By switching to __xsk_generic_xmit_batch(), does this expose a
use-after-free race with xsk_setsockopt()?

In __xsk_generic_xmit_batch(), the local pointer descs is assigned
batch->desc_cache before acquiring xs->mutex.

Concurrently, a user can call setsockopt(..., XDP_GENERIC_XMIT_BATCH),
which acquires xs->mutex and updates the batch size via xsk_init_batch().
This safely frees the old desc_cache array using kvfree() in
xsk_batch_reset() because the setsockopt path does not check for
xs->state != XSK_READY.

Once setsockopt drops the mutex, the blocked __xsk_generic_xmit_batch()
thread acquires the mutex and proceeds to use its stale descs pointer.
Can this lead to a use-after-free write into the freed descriptor array?

>  	/* Reaquire RCU lock before going into common code. */
>  	rcu_read_lock();
[ ... ]
> @@ -1620,6 +1533,34 @@ struct xdp_umem_reg_v1 {
>  	__u32 headroom;
>  };
>  
> +static int xsk_init_batch(struct xsk_batch *batch, unsigned int size)
> +{
> +	struct xdp_desc *descs;
> +	struct sk_buff **skbs;
> +	void **data;
> +
> +	skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);

Can this trigger a WARN_ON_ONCE in the page allocator if a user requests
a very large batch size?

The size parameter is controlled by userspace via the XDP_GENERIC_XMIT_BATCH
setsockopt and is bounded by the TX ring size (xs->max_tx_budget), which
can be quite large.

Since these pointer arrays can exceed KMALLOC_MAX_SIZE, should this be
allocated using kvmalloc_array() similar to how the descs array is
allocated below?

> +	if (!skbs)
> +		return -ENOMEM;
> +
> +	data = kmalloc_array(size, sizeof(void *), GFP_KERNEL);
> +	if (!data) {
> +		kfree(skbs);
> +		return -ENOMEM;
> +	}
> +
> +	descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
[ ... ]
> @@ -2065,6 +1984,11 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
>  	if (!sk)
>  		return -ENOBUFS;
>  
> +	xs = xdp_sk(sk);
> +	ret = xsk_init_batch(&xs->batch, 1);
> +	if (ret)
> +		return ret;
> +

Does this leak the sk object if xsk_init_batch() fails?

If xsk_init_batch() returns an error here, the function returns early
without calling sk_free(sk).

Because sock_init_data() has not been executed yet to link the sk to the
sock structure, the VFS socket layer will not free the socket memory when
tearing down the failed creation. Should this path clean up the allocated
socket before returning?

>  	sock->ops = &xsk_proto_ops;
>  
>  	sock_init_data(sock, sk);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=13

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
  2026-04-15  8:26 ` [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool Jason Xing
@ 2026-04-15  9:20   ` sashiko-bot
  2026-04-16 10:09     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:20 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] The patch groups NAPI RX write-heavy fields and NAPI TX write-heavy fields into the exact same cacheline, introducing severe false sharing between RX and TX CPUs.
--

commit cf39ff4311089098b2e57c37ba6e85e7a36ac4a8
Author: Jason Xing <kernelxing@tencent.com>

xsk: separate read-mostly and write-heavy fields in xsk_buff_pool

This commit separates read-mostly pool geometry fields from write-heavy
fields in struct xsk_buff_pool by adding cacheline alignment. This is
intended to eliminate false sharing between the sendmsg CPU and the NAPI
TX-completion CPU.

> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index 0b1abdb99c9ea..0376bd4cc95b0 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h

[ ... ]

> @@ -73,23 +73,27 @@ struct xsk_buff_pool {

[ ... ]

>  	void *addrs;
> +
> +	/* Write-heavy fields */
>  	/* Mutual exclusion of the completion ring in the SKB mode.
>  	 * Protect: NAPI TX thread and sendmsg error paths in the SKB
>  	 * destructor callback.
>  	 */
> -	spinlock_t cq_prod_lock;
> +	spinlock_t cq_prod_lock ____cacheline_aligned_in_smp;
> +	u8 cached_need_wakeup;
> +	u32 free_heads_cnt;
>  	struct xdp_buff_xsk *free_heads[];
>  };

Could placing these four fields together in the same aligned cacheline
introduce new false sharing between the NAPI RX and TX CPUs?

The fields cq_prod_lock and cached_need_wakeup are updated by the TX
completion path.

However, free_heads_cnt and the free_heads flexible array are heavily
mutated by the RX path during buffer allocation and release. 

Because cq_prod_lock (spinlock_t) and cached_need_wakeup (u8) only consume
a fraction of the 64-byte cacheline, free_heads_cnt and the first several
elements of the free_heads array will pack into this exact same cacheline.

If a deployment pins RX and TX interrupts to different CPU cores, it looks
like this might cause continuous cacheline bouncing between the RX CPU and
the TX CPU for every packet processed.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=12

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path
  2026-04-15  8:26 ` [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path Jason Xing
@ 2026-04-15  9:33   ` sashiko-bot
  2026-04-16  5:55     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:33 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [Critical] Bypassing `validate_xmit_skb_list()` based on `NETIF_F_SG` skips critical hardware capability checks, software fallbacks, and SKB initialization, leading to kernel panics and corrupted packets.
- [High] Breaking out of the transmit loop early leaves packets stuck in the TX ring indefinitely because the hardware doorbell is never flushed.
--

commit 7603ac608a36b8d39e026a60b567ea2b106617f0
Author: Jason Xing <kernelxing@tencent.com>

xsk: try to skip validating skb list in xmit path

This patch attempts to optimize the XSK transmit path by skipping
validate_xmit_skb_list() if the device supports Scatter-Gather,
aiming to improve performance for drivers like ixgbe.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 99188353abec6..12643cbab6530 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[ ... ]
> @@ -4878,15 +4879,17 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
>  	local_bh_disable();
>  	HARD_TX_LOCK(dev, txq, smp_processor_id());
>  	while ((skb = __skb_dequeue(send_queue)) != NULL) {
> -		struct sk_buff *orig_skb = skb;
> -		bool again = false;
> -
> -		skb = validate_xmit_skb_list(skb, dev, &again);
> -		if (skb != orig_skb) {
> -			dev_core_stats_tx_dropped_inc(dev);
> -			kfree_skb_list(skb);
> -			ret = NET_XMIT_DROP;
> -			break;
> +		if (unlikely(need_validate)) {
> +			struct sk_buff *orig_skb = skb;
> +			bool again = false;
> +
> +			skb = validate_xmit_skb_list(skb, dev, &again);

Does skipping validate_xmit_skb_list() based on NETIF_F_SG bypass necessary
hardware capability checks and software fallbacks?

validate_xmit_skb() performs dynamic feature verification via
ndo_features_check(), transport header initialization via
skb_set_transport_header(), and software checksum fallbacks via
skb_csum_hwoffload_help().

AF_XDP users can request TX checksum offload, setting skb->ip_summed to
CHECKSUM_PARTIAL. By skipping validation solely because the device supports
Scatter-Gather, packets bypass skb_set_transport_header(), leaving
skb->transport_header uninitialized.

Could this cause drivers that use skb_transport_offset() to compute massive
sizes, corrupting TX descriptors and potentially hanging the NIC?

Additionally, if the device lacks NETIF_F_HW_CSUM, skipping
skb_csum_hwoffload_help() sends CHECKSUM_PARTIAL to drivers unable to handle
it. Might this result in silent data corruption on the wire or driver crashes?

> +			if (skb != orig_skb) {
> +				dev_core_stats_tx_dropped_inc(dev);
> +				kfree_skb_list(skb);
> +				ret = NET_XMIT_DROP;
> +				break;

If this loop breaks early, does it leave packets stuck in the TX ring?

While this behavior was introduced earlier in the series and not by this
specific patch, the issue is present in this modified block.

The batching logic calls netdev_start_xmit() with more = true for all packets
except the last one in the queue.

If the loop breaks early here due to a validate_xmit_skb_list() failure,
the function returns without ever calling netdev_start_xmit() with
more = false.

Will this cause the driver to defer ringing the hardware TX doorbell,
leaving previously submitted packets stranded in the TX ring and causing
latency spikes or TX watchdog timeouts?

> +			}
>  		}
>  
>  		if (netif_xmit_frozen_or_drv_stopped(txq)) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit
  2026-04-15  8:26 ` [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit Jason Xing
@ 2026-04-15  9:35   ` sashiko-bot
  2026-04-16  3:43     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:35 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] Stalled TX queue due to missing hardware doorbell ring on early loop exit
--

commit 6d7c8f2ad056eb4cb2eec8d360e632f9ef9044f5
Author: Jason Xing <kernelxing@tencent.com>

xsk: support dynamic xmit.more control for batch xmit

This commit updates xsk batch transmission to only set xmit.more to false for
the last skb in the send queue, aiming to reduce unnecessary IRQs and improve
batching performance.

> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4873,6 +4873,7 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
>  	struct sk_buff_head *send_queue = &xs->batch.send_queue;
>  	int ret = NETDEV_TX_BUSY;
>  	struct sk_buff *skb;
> +	bool more = true;
>  
>  	local_bh_disable();
>  	HARD_TX_LOCK(dev, txq, smp_processor_id());

[ ... ]

> @@ -4892,8 +4893,12 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
>  			__skb_queue_head(send_queue, skb);
>  			break;
>  		}
> +
> +		if (!skb_peek(send_queue))
> +			more = false;
> +
>  		skb_set_queue_mapping(skb, queue_id);
> -		ret = netdev_start_xmit(skb, dev, txq, false);
> +		ret = netdev_start_xmit(skb, dev, txq, more);
>  		if (ret != NETDEV_TX_OK) {
>  			if (ret == NETDEV_TX_BUSY)
>  				__skb_queue_head(send_queue, skb);
>  			break;
>  		}
>  	}

If an error occurs and the loop breaks early, does this leave previously
transmitted packets stalled in the driver's TX ring?

Since more is passed as true for all packets prior to the failure, the
network driver defers ringing the hardware TX doorbell.

If the code breaks out of the loop due to validate_xmit_skb_list() failing,
netif_xmit_frozen_or_drv_stopped(), or netdev_start_xmit() returning
NETDEV_TX_BUSY, the loop terminates without ever passing more = false to the
driver.

Could this omission prevent the hardware from being notified to transmit
the successfully processed packets that preceded the error?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=6

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic
  2026-04-15  8:26 ` [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic Jason Xing
@ 2026-04-15  9:38   ` sashiko-bot
  2026-04-16  9:58     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:38 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 5 potential issue(s):
- [Critical] TOCTOU race on socket state pointers leading to NULL pointer dereference and Use-After-Free.
- [Critical] Integer underflow in `xsk_alloc_batch_skb()` leading to massive OOB memory allocation and kernel crash.
- [Critical] Unconditional dropping of partially built packets (`xs->skb`) causes buffer leaks and packet stream corruption.
- [High] Double cancellation of CQ slots and incorrect TX pointer manipulation on SKB build failure.
- [High] Infinite `-EAGAIN` livelock and stat inflation when packets span batch boundaries.
--

commit 4041fe3cb310f6f01b04ae2ff1fbd04cf0aa2713
Author: Jason Xing <kernelxing@tencent.com>
xsk: support batch xmit main logic

This commit implements the core logic for batched transmission in AF_XDP
sockets via __xsk_generic_xmit_batch(). It introduces a loop to read
descriptors, reserve completion queue slots, allocate SKBs, and transmit
packets in batches.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index b4f3cc7125804..16e47f6fc30ec 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
[ ... ]
> @@ -914,6 +914,122 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  	return ERR_PTR(err);
>  }
>  
> +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> +{
> +	struct xsk_buff_pool *pool = xs->pool;
> +	struct xsk_batch *batch = &xs->batch;
> +	struct xdp_desc *descs = batch->desc_cache;
> +	struct net_device *dev = xs->dev;
> +	u32 max_batch, max_budget;
> +	bool sent_frame = false;
> +	struct sk_buff *skb;
> +	u32 cons_descs;
> +	int err = 0;
> +	u32 i = 0;
> +
> +	mutex_lock(&xs->mutex);

Can a TOCTOU race occur here with a concurrent socket bind() or setsockopt()?

The pointers for pool, dev, and batch->desc_cache are read before the mutex
is acquired. If another thread modifies the socket state, could these
pointers be NULL or freed by the time xsk_is_bound() is checked, leading to
a use-after-free or NULL pointer dereference when accessing dev or descs?

> +
> +	/* Since we dropped the RCU read lock, the socket state might have changed. */
> +	if (unlikely(!xsk_is_bound(xs))) {
[ ... ]
> +	for (i = 0; i < max_budget; i += cons_descs) {
> +		u32 nb_pkts = 0;
> +		u32 nb_descs;
[ ... ]
> +		cons_descs = xskq_cons_read_desc_batch_copy(xs->tx, pool, descs,
> +							    nb_descs, &nb_pkts);
> +		if (cons_descs < nb_descs) {
> +			u32 delta = nb_descs - cons_descs;
> +
> +			xsk_cq_cancel_locked(pool, delta);
> +			xs->tx->queue_empty_descs += delta;
> +			if (!cons_descs) {
> +				err = -EAGAIN;
> +				goto out;
> +			}

Could this lead to an infinite -EAGAIN livelock and inflated statistics?

If a multi-fragment packet spans beyond the nb_descs batch limit without an
end-of-packet descriptor, xskq_cons_read_desc_batch_copy() might safely
rewind the fragments and return 0. The code then treats these valid
descriptors as empty, adds to queue_empty_descs, and returns -EAGAIN,
causing the application to loop infinitely hitting the same boundary.

> +			nb_descs = cons_descs;
> +		}
> +
> +		cons_descs = xsk_alloc_batch_skb(xs, nb_pkts, nb_descs, &err);

Does this calculation of nb_pkts trigger an integer underflow?

If a batch contains only fragments without an end-of-packet flag, nb_pkts
will be 0. Inside xsk_alloc_batch_skb(), if there is an existing partially
built skb (xs->skb != NULL), the code may decrement nb_pkts. Since nb_pkts
is unsigned, nb_pkts-- would underflow to 0xFFFFFFFF, resulting in a
massive out-of-bounds memory allocation.

> +		/* Return 'nb_descs - cons_descs' number of descs to the
> +		 * pool if the batch allocation partially fails
> +		 */
> +		if (cons_descs < nb_descs) {
> +			xskq_cons_cancel_n(xs->tx, nb_descs - cons_descs);
> +			xsk_cq_cancel_locked(pool, nb_descs - cons_descs);
> +		}

Can this cause a double cancellation of completion queue slots?

If xsk_build_skb() fails inside xsk_alloc_batch_skb() due to memory limits,
it already calls xsk_cq_cancel_locked() internally:

net/xdp/xsk.c:xsk_build_skb() {
	...
	if (err == -EOVERFLOW) {
		...
	} else {
		xsk_cq_cancel_locked(xs->pool, 1);
	}
}

Calling xsk_cq_cancel_locked() again here for the failed descriptors would
underflow the completion queue pointers, potentially corrupting the user
space ring buffer. 

Also, does xsk_build_skb() advancing the TX consumer pointer on -EOVERFLOW
conflict with the rewind happening here?

[ ... ]
> +		if (err)
> +			goto out;
> +	}
> +
> +	/* Maximum budget of descriptors have been consumed */
> +	if (xskq_has_descs(xs->tx))
> +		err = -EAGAIN;
> +
> +out:
> +	if (xs->skb)
> +		xsk_drop_skb(xs->skb);

Does unconditionally dropping the partially built xs->skb corrupt the packet
stream? 

If transmission hits a NETDEV_TX_BUSY (-EAGAIN) or a valid multi-fragment
packet spans a batch boundary, the code drops the fragments processed so far. 
Because the completion queue slots and TX pointers for these early fragments
might not be properly rewound or notified, does this permanently leak buffers
and merge unrelated fragments into corrupted packets on the next sendmsg()
retry?

> +
> +	/* If send_queue has more pending skbs, we must to clear
> +	 * the rest of them.
> +	 */
> +	while ((skb = __skb_dequeue(&batch->send_queue)) != NULL) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=11

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve
  2026-04-15  8:26 ` [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve Jason Xing
@ 2026-04-15  9:38   ` sashiko-bot
  2026-04-16  2:45     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:38 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [High] Cached data buffers are consumed without validating current allocation flags, potentially violating memory placement constraints.
- [Medium] Meaningful code and commit message mismatch regarding cache refill logic.
--

commit 5a442e9c44f0a64a147908843084cffe5e2fed9f
Author: Jason Xing <kernelxing@tencent.com>

xsk: cache data buffers to avoid frequently calling kmalloc_reserve

This commit introduces a per-socket data buffer cache for small packets
to reduce kmalloc_reserve() overhead. It adds bulk allocation from
skb_small_head_cache and persists unused buffers across batch cycles.

>     Inside the Phase-1 per-descriptor loop, when a small packet needs a
>     data buffer and the cache is empty, a single kmem_cache_alloc_bulk()
>     refills it with generic_xmit_batch objects. Subsequent small packets
>     pop directly from the cache. Large packets bypass the cache entirely
>     and fall back to kmalloc_reserve(). Unused buffers remain in the
>     cache for the next batch.

Is this description accurate? Looking at the implementation below, the bulk
allocation happens outside the Phase-1 loop. If the cache runs empty during
the loop, it appears to fall back to kmalloc_reserve() for each subsequent
packet rather than refilling the cache via kmem_cache_alloc_bulk(). Could
the commit message be updated to match the actual execution flow?

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c7e42c2966cdd..3b3969adfd23f 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
[ ... ]
> @@ -709,10 +718,17 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
>  
>  			skb = skbs[skb_count - 1 - i];
>  			skbuff_clear(skb);
> -			data = kmalloc_reserve(&size, gfp_mask, node, skb);
> -			if (unlikely(!data)) {
> -				*err = -ENOBUFS;
> -				break;
> +			if (dc_count &&
> +			    SKB_HEAD_ALIGN(size) <= SKB_SMALL_HEAD_CACHE_SIZE) {

If the socket's sk_allocation (which populates gfp_mask) was modified to
require restricted memory placement (e.g. __GFP_DMA), could this code use a
standard memory buffer left over from a previous batch?

While the bulk allocation check before the loop correctly avoids allocating
from skb_small_head_cache if KMALLOC_NOT_NORMAL_BITS is set, the
consumption path here only checks dc_count. Does this need to validate
gfp_mask against KMALLOC_NOT_NORMAL_BITS before consuming from the cache,
to ensure memory placement constraints are not bypassed?

> +				data = dc[--dc_count];
> +				size = SKB_SMALL_HEAD_CACHE_SIZE;
> +			} else {
> +				data = kmalloc_reserve(&size, gfp_mask,
> +						       node, skb);
> +				if (unlikely(!data)) {
> +					*err = -ENOBUFS;
> +					break;
> +				}
>  			}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path
  2026-04-15  8:26 ` [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path Jason Xing
@ 2026-04-15  9:47   ` sashiko-bot
  2026-04-16 13:12     ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: sashiko-bot @ 2026-04-15  9:47 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [High] An SKB memory leak occurs when xsk_build_skb() returns an error for a newly assigned SKB.
- [Medium] The prefetch instruction uses the raw AF_XDP descriptor address without extracting the offset, creating an invalid pointer in unaligned mode.
--

commit 7bdeef673e571c57c404cb8d05446f46b3e2676e
Author: Jason Xing <kernelxing@tencent.com>

xsk: optimize xsk_build_skb for batch copy-mode fast path

This commit introduces three targeted optimizations for the batch
copy-mode TX hot path in XDP sockets, including replacing skb_store_bits()
with memcpy(), inlining UMEM address computation, and prefetching the
next descriptor's UMEM data.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3b3969adfd23f..0fccb0c8cbe90 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -751,14 +751,28 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
>  	if (total_truesize)
>  		refcount_add(total_truesize, &xs->sk.sk_wmem_alloc);
>  
> -	/* Phase 3: Build SKBs with packet data */
> +	/* Phase 3: Build SKBs with packet data. */
> +	struct xsk_buff_pool *pool = xs->pool;
> +	void *pool_addrs = pool->addrs;
> +	bool unaligned = pool->unaligned;
> +
>  	for (j = 0; j < alloc_descs; j++) {
> +		u64 addr = descs[j].addr;
> +		void *buffer;
> +
> +		if (unaligned)
> +			addr = xp_unaligned_add_offset_to_addr(addr);
> +		buffer = pool_addrs + addr;
> +
> +		if (j + 1 < alloc_descs)
> +			prefetch(pool_addrs + descs[j + 1].addr);

Does this prefetch use the raw AF_XDP descriptor address without extracting
the offset?

In unaligned mode, an AF_XDP descriptor addr field encodes a chunk offset
in its upper 16 bits. While xp_unaligned_add_offset_to_addr() is used to
decode the current descriptor address, the prefetch directly adds
descs[j + 1].addr to pool_addrs.

Could this result in an out-of-bounds pointer in unaligned mode, failing to
hide memory latency and causing cache pollution?

> +
>  		if (!xs->skb) {
>  			skb = skbs[skb_count - 1 - k];
>  			k++;
>  		}
>  
> -		skb = xsk_build_skb(xs, skb, &descs[j]);
> +		skb = xsk_build_skb(xs, skb, &descs[j], buffer);
>  		if (IS_ERR(skb)) {
>  			*err = PTR_ERR(skb);
>  			break;

Can this leak the newly assigned SKB if xsk_build_skb() returns an error?

If xsk_build_skb() fails, it returns ERR_PTR(err) and the loop breaks. Since
k was already incremented, the cleanup loop in Phase 4 (while (k < i)
kfree_skb(...)) reclaims unused SKBs starting from the current k, skipping
the failed SKB at index k - 1.

Could a malicious userspace program exploit this to systematically leak SKBs
by triggering metadata errors?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=14

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2026-04-15  8:51   ` sashiko-bot
@ 2026-04-15 12:46     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15 12:46 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 4:51 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] Unsafe array allocation using kmalloc for a potentially large, user-controlled size
> --
>
> commit 16fdf26210767cd884154d2717d7894d96a2ee54
> Author: Jason Xing <kernelxing@tencent.com>
> xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
>
> This commit introduces the XDP_GENERIC_XMIT_BATCH setsockopt to allow
> batch processing in copy mode, attempting to improve throughput by
> allocating and initializing multiple skbs at one time.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 3b46bc635c432..1194268ae8676 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> [ ... ]
> > @@ -1582,6 +1593,42 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               WRITE_ONCE(xs->max_tx_budget, budget);
> >               return 0;
> >       }
> > +     case XDP_GENERIC_XMIT_BATCH:
> > +     {
> > +             struct xsk_buff_pool *pool = xs->pool;
> > +             struct xsk_batch *batch = &xs->batch;
> > +             struct xdp_desc *descs;
> > +             struct sk_buff **skbs;
> > +             unsigned int size;
> > +             int ret = 0;
> > +
> > +             if (optlen != sizeof(size))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&size, optval, sizeof(size)))
> > +                     return -EFAULT;
> > +             if (size == batch->generic_xmit_batch)
> > +                     return 0;
> > +             if (!size || size > xs->max_tx_budget || !pool)
> > +                     return -EACCES;
> > +
> > +             mutex_lock(&xs->mutex);
> > +             skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);
>
> Can the requested size here become large enough to trigger high-order
> allocation failures under memory fragmentation?
>
> Since size is bounded by xs->max_tx_budget, which follows the TX ring
> entries, it looks like this could reach up to 65536 on some configurations.
> A 512KB contiguous allocation via kmalloc might fail on long-running
> systems and trigger page allocation warnings in the kernel log.
>
> Also, since size is a user-provided value bounded only by the maximum
> budget, could the open-coded multiplication overflow on 32-bit platforms?
>
> Would it be safer to use kvmalloc_array() here to prevent both integer
> overflows and the dependency on physically contiguous memory, similar to
> how kvcalloc() is used for the descs array right below it?

kvmalloc_array() indeed makes more sense. Will update it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb
  2026-04-15  8:52   ` sashiko-bot
@ 2026-04-15 13:19     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-15 13:19 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 4:52 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] Inconsistent ownership semantics of `allocated_skb` on error paths lead to a memory leak in the zero-copy path.
> --
>
> commit a6016c1c59927d7e151f38b7f237ad699db86b1e
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: extend xsk_build_skb() to support passing an already allocated skb
>
> This patch extends xsk_build_skb() to accept an allocated_skb parameter,
> allowing the batch feature to reuse an already allocated skb instead of
> allocating a new one.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 1194268ae8676..becc8ebfd82f8 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
>
> [ ... ]
>
> > @@ -793,15 +797,16 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >       return skb;
> >  }
> >
> > -static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > -                                  struct xdp_desc *desc)
> > +struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > +                           struct sk_buff *allocated_skb,
> > +                           struct xdp_desc *desc)
> >  {
> >       struct net_device *dev = xs->dev;
> >       struct sk_buff *skb = xs->skb;
> >       int err;
> >
> >       if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > -             skb = xsk_build_skb_zerocopy(xs, desc);
> > +             skb = xsk_build_skb_zerocopy(xs, allocated_skb, desc);
> >               if (IS_ERR(skb)) {
> >                       err = PTR_ERR(skb);
> >                       skb = NULL;
> >                       goto free_err;
>
> Does this leak allocated_skb on the zero-copy error path?
>
> If xsk_build_skb_zerocopy() fails on the first descriptor (e.g., if
> xsk_skb_metadata() returns an error due to invalid metadata), it returns an
> ERR_PTR without freeing the SKB.
>
> Because skb is then explicitly set to NULL before jumping to free_err,
> kfree_skb() is skipped.
>
> While the original code also had a leak here for internally allocated SKBs,

Awesome! sashiko spotted a real bug prior to the series. I will cook a
standalone patch to fix that first.

Also I checked the MAX_SKB_FRAGS situation in xsk_build_skb_zerocopy()
that has the absolutely same issue.

While reviewing the whole destruction of skb in the error conditions,
I realized that there is one problem in between patch 2 and 13. Here
at least I need to add the following codes to make sure no wrong
accounting happen:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3bf81b838075..19bc786decbc 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -904,14 +904,16 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
        if (skb && !skb_shinfo(skb)->nr_frags)
                kfree_skb(skb);

-       if (err == -EOVERFLOW) {
-               /* Drop the packet */
-               xsk_inc_num_desc(xs->skb);
-               xsk_drop_skb(xs->skb);
-               xskq_cons_release(xs->tx);
-       } else {
-               /* Let application retry */
-               xsk_cq_cancel_locked(xs->pool, 1);
+       if (!allocated_skb) {
+               if (err == -EOVERFLOW) {
+                       /* Drop the packet */
+                       xsk_inc_num_desc(xs->skb);
+                       xsk_drop_skb(xs->skb);
+                       xskq_cons_release(xs->tx);
+               } else {
+                       /* Let application retry */
+                       xsk_cq_cancel_locked(xs->pool, 1);
+               }
        }

        return ERR_PTR(err);

And the above part will be removed in patch 13. It prevents someone
from bisecting in between two patches and triggering the issue.

Thanks,
Jason

> this patch extends the issue to caller-provided SKBs. In copy mode,
> xsk_build_skb() correctly frees allocated_skb on error, but the zero-copy
> path does not.
>
> Could this lead to memory exhaustion if a caller assumes xsk_build_skb()
> always consumes or frees the provided SKB on error?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch
  2026-04-15  9:17   ` sashiko-bot
@ 2026-04-16  1:18     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16  1:18 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:17 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [High] Pre-allocated SKB and socket memory charge (`sk_wmem_alloc`) are permanently leaked when zerocopy metadata processing fails.
> - [High] Double-free and Use-After-Free of `xs->skb` when fragment allocation fails with `-EAGAIN` in copy mode.
> --
>
> commit 3ef2c03b6204759327ab337fe34617d9dd8e04ef
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: add xsk_alloc_batch_skb() to build skbs in batch
>
> This commit introduces xsk_alloc_batch_skb() to support allocating and
> building skbs in batch. It reserves skbs, batches the sk_wmem_alloc charge,
> and then copies data to complete initialization.
>
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
>
> [ ... ]
>
> > +     /* Phase 3: Build SKBs with packet data */
> > +     for (j = 0; j < alloc_descs; j++) {
> > +             if (!xs->skb) {
> > +                     skb = skbs[skb_count - 1 - k];
> > +                     k++;
> > +             }
> > +
> > +             skb = xsk_build_skb(xs, skb, &descs[j]);
> > +             if (IS_ERR(skb)) {
> > +                     *err = PTR_ERR(skb);
> > +                     break;
> > +             }
>
> Can this lead to a double-free and use-after-free of xs->skb if a fragment
> allocation fails?
>
> If xsk_build_skb() fails to allocate a page for a subsequent fragment in copy
> mode, it frees the partially built skb and returns -EAGAIN. However, it
> leaves xs->skb pointing to this freed memory.
>
> When this loop breaks and returns -EAGAIN up the call chain, the caller
> __xsk_generic_xmit_batch() checks xs->skb and calls xsk_drop_skb(xs->skb),
> which would double-free the skb. Furthermore, any subsequent transmissions
> could dereference the dangling pointer. Should xs->skb be explicitly cleared
> when handling this error?

Ah, great job! Sashiko again found a real pre-existing bug! I will
include it in my separate fixes.

The corresponding fix would be adding a more strict check in
xsk_build_skb() to exclude the case and let the caller to free it:
free_err:
        if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
                kfree_skb(skb);

>
> > +
> > +             if (xp_mb_desc(&descs[j])) {
> > +                     xs->skb = skb;
> > +                     continue;
> > +             }
> > +
> > +             xs->skb = NULL;
> > +             __skb_queue_tail(&batch->send_queue, skb);
> > +     }
> > +
> > +     /* Phase 4: Reclaim unused allocated SKBs */
> > +     while (k < i)
> > +             kfree_skb(skbs[skb_count - 1 - k++]);
>
> Does this permanently leak the pre-allocated skb and its socket memory charge
> if xsk_build_skb() fails?
>
> When xsk_build_skb() encounters an error like a zerocopy metadata processing
> failure, it returns an ERR_PTR() without freeing the skb. Because k was
> already incremented before the xsk_build_skb() call, the Phase 4 loop starts
> reclaiming at the next skb.
>
> The failed skb at index k - 1 is bypassed. Since it is never freed,
> sock_wfree is skipped, leaking both the skb and the batched sk_wmem_alloc
> charge.

This is the pre-existing bug that you pointed out in patch 2. I'll
take care of it for sure.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve
  2026-04-15  9:38   ` sashiko-bot
@ 2026-04-16  2:45     ` Jason Xing
  2026-04-16 12:18       ` Jason Xing
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-16  2:45 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:38 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [High] Cached data buffers are consumed without validating current allocation flags, potentially violating memory placement constraints.
> - [Medium] Meaningful code and commit message mismatch regarding cache refill logic.
> --
>
> commit 5a442e9c44f0a64a147908843084cffe5e2fed9f
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: cache data buffers to avoid frequently calling kmalloc_reserve
>
> This commit introduces a per-socket data buffer cache for small packets
> to reduce kmalloc_reserve() overhead. It adds bulk allocation from
> skb_small_head_cache and persists unused buffers across batch cycles.
>
> >     Inside the Phase-1 per-descriptor loop, when a small packet needs a
> >     data buffer and the cache is empty, a single kmem_cache_alloc_bulk()
> >     refills it with generic_xmit_batch objects. Subsequent small packets
> >     pop directly from the cache. Large packets bypass the cache entirely
> >     and fall back to kmalloc_reserve(). Unused buffers remain in the
> >     cache for the next batch.
>
> Is this description accurate? Looking at the implementation below, the bulk
> allocation happens outside the Phase-1 loop. If the cache runs empty during
> the loop, it appears to fall back to kmalloc_reserve() for each subsequent
> packet rather than refilling the cache via kmem_cache_alloc_bulk(). Could
> the commit message be updated to match the actual execution flow?

Will update it.

>
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index c7e42c2966cdd..3b3969adfd23f 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> [ ... ]
> > @@ -709,10 +718,17 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
> >
> >                       skb = skbs[skb_count - 1 - i];
> >                       skbuff_clear(skb);
> > -                     data = kmalloc_reserve(&size, gfp_mask, node, skb);
> > -                     if (unlikely(!data)) {
> > -                             *err = -ENOBUFS;
> > -                             break;
> > +                     if (dc_count &&
> > +                         SKB_HEAD_ALIGN(size) <= SKB_SMALL_HEAD_CACHE_SIZE) {
>
> If the socket's sk_allocation (which populates gfp_mask) was modified to
> require restricted memory placement (e.g. __GFP_DMA), could this code use a
> standard memory buffer left over from a previous batch?
>
> While the bulk allocation check before the loop correctly avoids allocating
> from skb_small_head_cache if KMALLOC_NOT_NORMAL_BITS is set, the
> consumption path here only checks dc_count. Does this need to validate
> gfp_mask against KMALLOC_NOT_NORMAL_BITS before consuming from the cache,
> to ensure memory placement constraints are not bypassed?

I can add the check but it seems meaningless to me. In the whole
implementation of xsk, there is no place where sk_allocation can be
set to a different value other than GPF_KERNEL.

Thanks,
Jason

>
> > +                             data = dc[--dc_count];
> > +                             size = SKB_SMALL_HEAD_CACHE_SIZE;
> > +                     } else {
> > +                             data = kmalloc_reserve(&size, gfp_mask,
> > +                                                    node, skb);
> > +                             if (unlikely(!data)) {
> > +                                     *err = -ENOBUFS;
> > +                                     break;
> > +                             }
> >                       }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function
  2026-04-15  9:11   ` sashiko-bot
@ 2026-04-16  3:04     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16  3:04 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:12 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 4 potential issue(s):
> - [High] Packet validation executed inside hardware transmit lock, causing severe lock contention
> - [High] Missing transmit recursion protection introduces deadlock vectors
> - [High] Missing device state validation allows transmission to stopped or suspended devices
> - [Medium] Hardware doorbell batching is defeated by hardcoded 'more' parameter
> --
>
> commit b98984346f64d24e3ecd521b219464ccbe9fba91
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: add direct xmit in batch function
>
> This commit adds a batch transmission logic for AF_XDP. It introduces a
> new direct transmit function that disables bottom halves, grabs the lock
> once, and sends all aggregated packets in a loop.
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
>
> [ ... ]
>
> > +int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
> > +{
> > +     u16 queue_id = xs->queue_id;
> > +     struct netdev_queue *txq = netdev_get_tx_queue(dev, queue_id);
> > +     struct sk_buff_head *send_queue = &xs->batch.send_queue;
> > +     int ret = NETDEV_TX_BUSY;
> > +     struct sk_buff *skb;
> > +
>
> Should there be a check for the device state before attempting to transmit?
> Direct transmit paths that bypass the Qdisc layer, like __dev_direct_xmit(),
> typically verify that the device is running and the carrier is up:
>
>         if (unlikely(!netif_running(dev) ||
>                      !netif_carrier_ok(dev)))
>                 goto drop;

Actually I did it at the very beginning of the sendmsg function, but
with the respect to the original code I will move it.

>
> Could invoking transmission on a stopped or suspended device cause issues?
>
> > +     local_bh_disable();
>
> Are we missing transmit recursion protection here? Standard paths wrap the
> hardware transmit lock with recursion tracking to protect against stack
> overflows and deadlocks when routing through virtual devices:
>
>         dev_xmit_recursion_inc();
>         HARD_TX_LOCK(dev, txq, smp_processor_id());

This is intended because I think it's not necessary for XSK that is
only triggered by sendmsg only in copy mode. No such terrible thing is
likely to happen as no one binds xsk to macvlan type of devices.

I will reconsider it, fearing some over-engineering work impacts the
performance.

>
> > +     HARD_TX_LOCK(dev, txq, smp_processor_id());
> > +     while ((skb = __skb_dequeue(send_queue)) != NULL) {
> > +             struct sk_buff *orig_skb = skb;
> > +             bool again = false;
> > +
> > +             skb = validate_xmit_skb_list(skb, dev, &again);
>
> Will executing validate_xmit_skb_list() inside the hardware transmit lock
> cause lock contention? This function can perform computationally expensive
> operations such as GSO segmentation, software checksum fallbacks, and
> subsystem hooks. Other transmit paths execute this validation strictly
> before acquiring the hardware lock.

The next patch adds a fast path check for this so it has been improved.

>
> > +             if (skb != orig_skb) {
> > +                     dev_core_stats_tx_dropped_inc(dev);
> > +                     kfree_skb_list(skb);
> > +                     ret = NET_XMIT_DROP;
> > +                     break;
> > +             }
> > +
> > +             if (netif_xmit_frozen_or_drv_stopped(txq)) {
> > +                     __skb_queue_head(send_queue, skb);
> > +                     break;
> > +             }
> > +             skb_set_queue_mapping(skb, queue_id);
> > +             ret = netdev_start_xmit(skb, dev, txq, false);
>
> Does passing false for the more parameter defeat the purpose of hardware
> doorbell batching? It instructs the driver to ring the hardware doorbell for
> every single packet instead of delaying it until the end of the batch.
>
> It looks like this might be fixed later in the patch series by commit
> 6d7c8f2ad056e ("xsk: support dynamic xmit.more control for batch xmit").

Yep, it was suggested by Jesper already.

Thanks,
Jason

>
> > +             if (ret != NETDEV_TX_OK) {
> > +                     if (ret == NETDEV_TX_BUSY)
> > +                             __skb_queue_head(send_queue, skb);
> > +                     break;
> > +             }
> > +     }
> > +     HARD_TX_UNLOCK(dev, txq);
>
> Also, we would need to pair any added recursion increment with a decrement
> here:
>
>         dev_xmit_recursion_dec();
>
> > +     local_bh_enable();
> > +
> > +     return ret;
> > +}
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit
  2026-04-15  9:35   ` sashiko-bot
@ 2026-04-16  3:43     ` Jason Xing
  2026-04-16  4:50       ` Dmitry Torokhov
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Xing @ 2026-04-16  3:43 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:35 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] Stalled TX queue due to missing hardware doorbell ring on early loop exit
> --
>
> commit 6d7c8f2ad056eb4cb2eec8d360e632f9ef9044f5
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: support dynamic xmit.more control for batch xmit
>
> This commit updates xsk batch transmission to only set xmit.more to false for
> the last skb in the send queue, aiming to reduce unnecessary IRQs and improve
> batching performance.
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4873,6 +4873,7 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
> >       struct sk_buff_head *send_queue = &xs->batch.send_queue;
> >       int ret = NETDEV_TX_BUSY;
> >       struct sk_buff *skb;
> > +     bool more = true;
> >
> >       local_bh_disable();
> >       HARD_TX_LOCK(dev, txq, smp_processor_id());
>
> [ ... ]
>
> > @@ -4892,8 +4893,12 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
> >                       __skb_queue_head(send_queue, skb);
> >                       break;
> >               }
> > +
> > +             if (!skb_peek(send_queue))
> > +                     more = false;
> > +
> >               skb_set_queue_mapping(skb, queue_id);
> > -             ret = netdev_start_xmit(skb, dev, txq, false);
> > +             ret = netdev_start_xmit(skb, dev, txq, more);
> >               if (ret != NETDEV_TX_OK) {
> >                       if (ret == NETDEV_TX_BUSY)
> >                               __skb_queue_head(send_queue, skb);
> >                       break;
> >               }
> >       }
>
> If an error occurs and the loop breaks early, does this leave previously
> transmitted packets stalled in the driver's TX ring?
>
> Since more is passed as true for all packets prior to the failure, the
> network driver defers ringing the hardware TX doorbell.
>
> If the code breaks out of the loop due to validate_xmit_skb_list() failing,
> netif_xmit_frozen_or_drv_stopped(), or netdev_start_xmit() returning
> NETDEV_TX_BUSY, the loop terminates without ever passing more = false to the
> driver.
>
> Could this omission prevent the hardware from being notified to transmit
> the successfully processed packets that preceded the error?

The quite similar code snippet was added in dev_hard_start_xmit()
introduced by the following commit in 2014:
commit 95f6b3dda2a4a052f7dabe9998e4ffac491b7bc2
Author: David S. Miller <davem@davemloft.net>
Date:   Fri Aug 29 21:57:30 2014 -0700

    net: Have xmit_list() signal more==true when appropriate.

    Signed-off-by: David S. Miller <davem@davemloft.net>

Look at the doorbell of drivers, like ixgbe:
if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) {
        writel(i, tx_ring->tail);
}

It has already taken care of it.

If we really want to cover all the possible drivers in the above two
cases, maybe we need a separate patchset to add a common interface of
devices to actively kick the hardware. But now I choose to defer it
unless someone insists on this.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=6

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit
  2026-04-16  3:43     ` Jason Xing
@ 2026-04-16  4:50       ` Dmitry Torokhov
  2026-04-16  4:51         ` Dmitry Torokhov
  0 siblings, 1 reply; 40+ messages in thread
From: Dmitry Torokhov @ 2026-04-16  4:50 UTC (permalink / raw)
  To: Jason Xing; +Cc: sashiko, bpf

On Thu, Apr 16, 2026 at 11:43:53AM +0800, Jason Xing wrote:
> On Wed, Apr 15, 2026 at 5:35 PM <sashiko-bot@kernel.org> wrote:
> >
> > Sashiko AI review found 1 potential issue(s):
> > - [High] Stalled TX queue due to missing hardware doorbell ring on early loop exit

Does sahshiko bot have reply-to set to "sashiko@lists.linux.dev" by
chance? And can it be changed? I am subscribed here for discussions and
news about sashiko itself, not for replies from all different kernel
subsystems to reviews of their patches...

Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit
  2026-04-16  4:50       ` Dmitry Torokhov
@ 2026-04-16  4:51         ` Dmitry Torokhov
  0 siblings, 0 replies; 40+ messages in thread
From: Dmitry Torokhov @ 2026-04-16  4:51 UTC (permalink / raw)
  To: Jason Xing, roman.gushchin; +Cc: sashiko, bpf

+Roman

On Wed, Apr 15, 2026 at 09:50:10PM -0700, Dmitry Torokhov wrote:
> On Thu, Apr 16, 2026 at 11:43:53AM +0800, Jason Xing wrote:
> > On Wed, Apr 15, 2026 at 5:35 PM <sashiko-bot@kernel.org> wrote:
> > >
> > > Sashiko AI review found 1 potential issue(s):
> > > - [High] Stalled TX queue due to missing hardware doorbell ring on early loop exit
> 
> Does sahshiko bot have reply-to set to "sashiko@lists.linux.dev" by
> chance? And can it be changed? I am subscribed here for discussions and
> news about sashiko itself, not for replies from all different kernel
> subsystems to reviews of their patches...
> 
> Thanks.
> 

-- 
Dmitry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path
  2026-04-15  9:33   ` sashiko-bot
@ 2026-04-16  5:55     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16  5:55 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:33 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [Critical] Bypassing `validate_xmit_skb_list()` based on `NETIF_F_SG` skips critical hardware capability checks, software fallbacks, and SKB initialization, leading to kernel panics and corrupted packets.
> - [High] Breaking out of the transmit loop early leaves packets stuck in the TX ring indefinitely because the hardware doorbell is never flushed.
> --
>
> commit 7603ac608a36b8d39e026a60b567ea2b106617f0
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: try to skip validating skb list in xmit path
>
> This patch attempts to optimize the XSK transmit path by skipping
> validate_xmit_skb_list() if the device supports Scatter-Gather,
> aiming to improve performance for drivers like ixgbe.
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 99188353abec6..12643cbab6530 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> [ ... ]
> > @@ -4878,15 +4879,17 @@ int xsk_direct_xmit_batch(struct xdp_sock *xs, struct net_device *dev)
> >       local_bh_disable();
> >       HARD_TX_LOCK(dev, txq, smp_processor_id());
> >       while ((skb = __skb_dequeue(send_queue)) != NULL) {
> > -             struct sk_buff *orig_skb = skb;
> > -             bool again = false;
> > -
> > -             skb = validate_xmit_skb_list(skb, dev, &again);
> > -             if (skb != orig_skb) {
> > -                     dev_core_stats_tx_dropped_inc(dev);
> > -                     kfree_skb_list(skb);
> > -                     ret = NET_XMIT_DROP;
> > -                     break;
> > +             if (unlikely(need_validate)) {
> > +                     struct sk_buff *orig_skb = skb;
> > +                     bool again = false;
> > +
> > +                     skb = validate_xmit_skb_list(skb, dev, &again);
>
> Does skipping validate_xmit_skb_list() based on NETIF_F_SG bypass necessary
> hardware capability checks and software fallbacks?
>
> validate_xmit_skb() performs dynamic feature verification via
> ndo_features_check(), transport header initialization via
> skb_set_transport_header(), and software checksum fallbacks via
> skb_csum_hwoffload_help().
>
> AF_XDP users can request TX checksum offload, setting skb->ip_summed to
> CHECKSUM_PARTIAL. By skipping validation solely because the device supports
> Scatter-Gather, packets bypass skb_set_transport_header(), leaving
> skb->transport_header uninitialized.
>
> Could this cause drivers that use skb_transport_offset() to compute massive
> sizes, corrupting TX descriptors and potentially hanging the NIC?
>
> Additionally, if the device lacks NETIF_F_HW_CSUM, skipping
> skb_csum_hwoffload_help() sends CHECKSUM_PARTIAL to drivers unable to handle
> it. Might this result in silent data corruption on the wire or driver crashes?

Makes sense. I will exclude this metadata case (using csum) by adding
a new condition:
if (unlikely(need_validate || skb->ip_summed == CHECKSUM_PARTIAL)) {

>
> > +                     if (skb != orig_skb) {
> > +                             dev_core_stats_tx_dropped_inc(dev);
> > +                             kfree_skb_list(skb);
> > +                             ret = NET_XMIT_DROP;
> > +                             break;
>
> If this loop breaks early, does it leave packets stuck in the TX ring?
>
> While this behavior was introduced earlier in the series and not by this
> specific patch, the issue is present in this modified block.
>
> The batching logic calls netdev_start_xmit() with more = true for all packets
> except the last one in the queue.
>
> If the loop breaks early here due to a validate_xmit_skb_list() failure,
> the function returns without ever calling netdev_start_xmit() with
> more = false.
>
> Will this cause the driver to defer ringing the hardware TX doorbell,
> leaving previously submitted packets stranded in the TX ring and causing
> latency spikes or TX watchdog timeouts?

I answered in patch 6. No need to take action in this series unless
maintainers request me to do this first.

Thanks,
Jason

>
> > +                     }
> >               }
> >
> >               if (netif_xmit_frozen_or_drv_stopped(txq)) {
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic
  2026-04-15  9:38   ` sashiko-bot
@ 2026-04-16  9:58     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16  9:58 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:38 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 5 potential issue(s):
> - [Critical] TOCTOU race on socket state pointers leading to NULL pointer dereference and Use-After-Free.
> - [Critical] Integer underflow in `xsk_alloc_batch_skb()` leading to massive OOB memory allocation and kernel crash.
> - [Critical] Unconditional dropping of partially built packets (`xs->skb`) causes buffer leaks and packet stream corruption.
> - [High] Double cancellation of CQ slots and incorrect TX pointer manipulation on SKB build failure.
> - [High] Infinite `-EAGAIN` livelock and stat inflation when packets span batch boundaries.
> --
>
> commit 4041fe3cb310f6f01b04ae2ff1fbd04cf0aa2713
> Author: Jason Xing <kernelxing@tencent.com>
> xsk: support batch xmit main logic
>
> This commit implements the core logic for batched transmission in AF_XDP
> sockets via __xsk_generic_xmit_batch(). It introduces a loop to read
> descriptors, reserve completion queue slots, allocate SKBs, and transmit
> packets in batches.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index b4f3cc7125804..16e47f6fc30ec 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> [ ... ]
> > @@ -914,6 +914,122 @@ struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       return ERR_PTR(err);
> >  }
> >
> > +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> > +{
> > +     struct xsk_buff_pool *pool = xs->pool;
> > +     struct xsk_batch *batch = &xs->batch;
> > +     struct xdp_desc *descs = batch->desc_cache;
> > +     struct net_device *dev = xs->dev;
> > +     u32 max_batch, max_budget;
> > +     bool sent_frame = false;
> > +     struct sk_buff *skb;
> > +     u32 cons_descs;
> > +     int err = 0;
> > +     u32 i = 0;
> > +
> > +     mutex_lock(&xs->mutex);
>
> Can a TOCTOU race occur here with a concurrent socket bind() or setsockopt()?
>
> The pointers for pool, dev, and batch->desc_cache are read before the mutex
> is acquired. If another thread modifies the socket state, could these
> pointers be NULL or freed by the time xsk_is_bound() is checked, leading to
> a use-after-free or NULL pointer dereference when accessing dev or descs?

Indeed, I'm going to move those read operations under the mutex lock protection.

>
> > +
> > +     /* Since we dropped the RCU read lock, the socket state might have changed. */
> > +     if (unlikely(!xsk_is_bound(xs))) {
> [ ... ]
> > +     for (i = 0; i < max_budget; i += cons_descs) {
> > +             u32 nb_pkts = 0;
> > +             u32 nb_descs;
> [ ... ]
> > +             cons_descs = xskq_cons_read_desc_batch_copy(xs->tx, pool, descs,
> > +                                                         nb_descs, &nb_pkts);
> > +             if (cons_descs < nb_descs) {
> > +                     u32 delta = nb_descs - cons_descs;
> > +
> > +                     xsk_cq_cancel_locked(pool, delta);
> > +                     xs->tx->queue_empty_descs += delta;
> > +                     if (!cons_descs) {
> > +                             err = -EAGAIN;
> > +                             goto out;
> > +                     }
>
> Could this lead to an infinite -EAGAIN livelock and inflated statistics?
>
> If a multi-fragment packet spans beyond the nb_descs batch limit without an
> end-of-packet descriptor, xskq_cons_read_desc_batch_copy() might safely
> rewind the fragments and return 0. The code then treats these valid
> descriptors as empty, adds to queue_empty_descs, and returns -EAGAIN,
> causing the application to loop infinitely hitting the same boundary.

Point taken. It costs me some time to investigate how to support
multi-buffer in this scenario. I finally come to conclusion that
change the behavior of __xskq_cons_read_desc_batch() should be fine:
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c3b97c6f2910..c8b7079a6728 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -270,6 +270,7 @@ __xskq_cons_read_desc_batch(struct xsk_queue *q,
struct xsk_buff_pool *pool,
 {
        u32 cached_cons = q->cached_cons, nb_entries = 0;
        u32 total_descs = 0, nr_frags = 0;
+       bool copy = !!nb_pkts;

        /* track first entry, if stumble upon *any* invalid descriptor, rewind
         * current packet that consists of frags and stop the processing
@@ -288,7 +289,7 @@ __xskq_cons_read_desc_batch(struct xsk_queue *q,
struct xsk_buff_pool *pool,
                if (likely(!parsed.mb)) {
                        total_descs += (nr_frags + 1);
                        nr_frags = 0;
-                       if (nb_pkts)
+                       if (copy)
                                (*nb_pkts)++;
                } else {
                        nr_frags++;
@@ -300,10 +301,11 @@ __xskq_cons_read_desc_batch(struct xsk_queue *q,
struct xsk_buff_pool *pool,
                nb_entries++;
        }

-       cached_cons -= nr_frags;
+       if (!copy)
+               cached_cons -= nr_frags;
        /* Release valid plus any invalid entries */
        xskq_cons_release_n(q, cached_cons - q->cached_cons);
-       return total_descs;
+       return copy ? nb_entries : total_descs;
 }

 static inline u32

After this, this function shall return the correct descriptors.

>
> > +                     nb_descs = cons_descs;
> > +             }
> > +
> > +             cons_descs = xsk_alloc_batch_skb(xs, nb_pkts, nb_descs, &err);
>
> Does this calculation of nb_pkts trigger an integer underflow?
>
> If a batch contains only fragments without an end-of-packet flag, nb_pkts
> will be 0. Inside xsk_alloc_batch_skb(), if there is an existing partially
> built skb (xs->skb != NULL), the code may decrement nb_pkts. Since nb_pkts
> is unsigned, nb_pkts-- would underflow to 0xFFFFFFFF, resulting in a
> massive out-of-bounds memory allocation.

Right, I should avoid that under the condition where the nb_pkts is zero :
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a61477d04c14..e04eda59eb7a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -671,7 +671,7 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32
nb_pkts, u32 nb_descs, int *err
        if (!(dev->priv_flags & IFF_TX_SKB_NO_LINEAR))
                base_len += dev->needed_tailroom;

-       if (xs->skb)
+       if (xs->skb && nb_pkts > 0)
                nb_pkts--;

        if (skb_count >= nb_pkts)

>
> > +             /* Return 'nb_descs - cons_descs' number of descs to the
> > +              * pool if the batch allocation partially fails
> > +              */
> > +             if (cons_descs < nb_descs) {
> > +                     xskq_cons_cancel_n(xs->tx, nb_descs - cons_descs);
> > +                     xsk_cq_cancel_locked(pool, nb_descs - cons_descs);
> > +             }
>
> Can this cause a double cancellation of completion queue slots?
>
> If xsk_build_skb() fails inside xsk_alloc_batch_skb() due to memory limits,
> it already calls xsk_cq_cancel_locked() internally:
>
> net/xdp/xsk.c:xsk_build_skb() {
>         ...
>         if (err == -EOVERFLOW) {
>                 ...
>         } else {
>                 xsk_cq_cancel_locked(xs->pool, 1);
>         }
> }
>
> Calling xsk_cq_cancel_locked() again here for the failed descriptors would
> underflow the completion queue pointers, potentially corrupting the user
> space ring buffer.
>
> Also, does xsk_build_skb() advancing the TX consumer pointer on -EOVERFLOW
> conflict with the rewind happening here?

Right, I mentioned in the previous patch that we can handle this issue
in patch 2 to bypass this part in batch mode to solve the problem.

>
> [ ... ]
> > +             if (err)
> > +                     goto out;
> > +     }
> > +
> > +     /* Maximum budget of descriptors have been consumed */
> > +     if (xskq_has_descs(xs->tx))
> > +             err = -EAGAIN;
> > +
> > +out:
> > +     if (xs->skb)
> > +             xsk_drop_skb(xs->skb);
>
> Does unconditionally dropping the partially built xs->skb corrupt the packet
> stream?
>
> If transmission hits a NETDEV_TX_BUSY (-EAGAIN) or a valid multi-fragment
> packet spans a batch boundary, the code drops the fragments processed so far.
> Because the completion queue slots and TX pointers for these early fragments
> might not be properly rewound or notified, does this permanently leak buffers
> and merge unrelated fragments into corrupted packets on the next sendmsg()
> retry?

Right. It will be fixed by:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 8c115202fe42..c7c81ee81c00 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1020,7 +1020,7 @@ static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
                err = -EAGAIN;

 out:
-       if (xs->skb)
+       if (xs->skb && err != -EAGAIN)
                xsk_drop_skb(xs->skb);

Thanks,
Jason

>
> > +
> > +     /* If send_queue has more pending skbs, we must to clear
> > +      * the rest of them.
> > +      */
> > +     while ((skb = __skb_dequeue(&batch->send_queue)) != NULL) {
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=11

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
  2026-04-15  9:20   ` sashiko-bot
@ 2026-04-16 10:09     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16 10:09 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:20 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] The patch groups NAPI RX write-heavy fields and NAPI TX write-heavy fields into the exact same cacheline, introducing severe false sharing between RX and TX CPUs.
> --
>
> commit cf39ff4311089098b2e57c37ba6e85e7a36ac4a8
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
>
> This commit separates read-mostly pool geometry fields from write-heavy
> fields in struct xsk_buff_pool by adding cacheline alignment. This is
> intended to eliminate false sharing between the sendmsg CPU and the NAPI
> TX-completion CPU.
>
> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index 0b1abdb99c9ea..0376bd4cc95b0 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
>
> [ ... ]
>
> > @@ -73,23 +73,27 @@ struct xsk_buff_pool {
>
> [ ... ]
>
> >       void *addrs;
> > +
> > +     /* Write-heavy fields */
> >       /* Mutual exclusion of the completion ring in the SKB mode.
> >        * Protect: NAPI TX thread and sendmsg error paths in the SKB
> >        * destructor callback.
> >        */
> > -     spinlock_t cq_prod_lock;
> > +     spinlock_t cq_prod_lock ____cacheline_aligned_in_smp;
> > +     u8 cached_need_wakeup;
> > +     u32 free_heads_cnt;
> >       struct xdp_buff_xsk *free_heads[];
> >  };
>
> Could placing these four fields together in the same aligned cacheline
> introduce new false sharing between the NAPI RX and TX CPUs?
>
> The fields cq_prod_lock and cached_need_wakeup are updated by the TX
> completion path.
>
> However, free_heads_cnt and the free_heads flexible array are heavily
> mutated by the RX path during buffer allocation and release.
>
> Because cq_prod_lock (spinlock_t) and cached_need_wakeup (u8) only consume
> a fraction of the 64-byte cacheline, free_heads_cnt and the first several
> elements of the free_heads array will pack into this exact same cacheline.
>
> If a deployment pins RX and TX interrupts to different CPU cores, it looks
> like this might cause continuous cacheline bouncing between the RX CPU and
> the TX CPU for every packet processed.

It's not introduced by this patch. Previously those counters were put
in the same cacheline.

But the suggestion is good, which inspires me to separate the tx and
rx counters at the expense of adding more bits of the whole structure.

Now the problem is gone:
spinlock_t cq_prod_lock ____cacheline_aligned_in_smp;
u8 cached_need_wakeup;
u32 free_heads_cnt ____cacheline_aligned_in_smp;
struct xdp_buff_xsk *free_heads[];

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=12

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode
  2026-04-15  9:18   ` sashiko-bot
@ 2026-04-16 10:33     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16 10:33 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:18 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 3 potential issue(s):
> - [Critical] Time-of-Check to Time-of-Use (TOCTOU) Use-After-Free (UAF) vulnerability on `batch->desc_cache`.
> - [High] Leaked socket structure on error path in `xsk_create()`.
> - [High] Unbounded `kmalloc` size for user-controlled batch size triggers `WARN_ON` and potential panic.
> --
>
> commit 5f2cebbff2efe715b882067f337741749751ac65
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: retire old xmit path in copy mode
>
> This commit adds a new helper xsk_init_batch() with a default value of 1,
> and obsoletes __xsk_generic_xmit() by substituting it with
> __xsk_generic_xmit_batch() in the generic transmit path.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 16e47f6fc30ec..232e7bf96b342 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> [ ... ]
> > @@ -1030,101 +1030,14 @@ static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> >       return err;
> >  }
> >
> [ ... ]
> >  static int xsk_generic_xmit(struct sock *sk)
> >  {
> > +     struct xdp_sock *xs = xdp_sk(sk);
> >       int ret;
> >
> >       /* Drop the RCU lock since the SKB path might sleep. */
> >       rcu_read_unlock();
> > -     ret = __xsk_generic_xmit(sk);
> > +     ret = __xsk_generic_xmit_batch(xs);
>
> By switching to __xsk_generic_xmit_batch(), does this expose a
> use-after-free race with xsk_setsockopt()?
>
> In __xsk_generic_xmit_batch(), the local pointer descs is assigned
> batch->desc_cache before acquiring xs->mutex.
>
> Concurrently, a user can call setsockopt(..., XDP_GENERIC_XMIT_BATCH),
> which acquires xs->mutex and updates the batch size via xsk_init_batch().
> This safely frees the old desc_cache array using kvfree() in
> xsk_batch_reset() because the setsockopt path does not check for
> xs->state != XSK_READY.
>
> Once setsockopt drops the mutex, the blocked __xsk_generic_xmit_batch()
> thread acquires the mutex and proceeds to use its stale descs pointer.
> Can this lead to a use-after-free write into the freed descriptor array?

Ack. This is the issue we've discussed previously. I will move those
read operations underneath the mutex lock in
__xsk_generic_xmit_batch().

>
> >       /* Reaquire RCU lock before going into common code. */
> >       rcu_read_lock();
> [ ... ]
> > @@ -1620,6 +1533,34 @@ struct xdp_umem_reg_v1 {
> >       __u32 headroom;
> >  };
> >
> > +static int xsk_init_batch(struct xsk_batch *batch, unsigned int size)
> > +{
> > +     struct xdp_desc *descs;
> > +     struct sk_buff **skbs;
> > +     void **data;
> > +
> > +     skbs = kmalloc(size * sizeof(struct sk_buff *), GFP_KERNEL);
>
> Can this trigger a WARN_ON_ONCE in the page allocator if a user requests
> a very large batch size?
>
> The size parameter is controlled by userspace via the XDP_GENERIC_XMIT_BATCH
> setsockopt and is bounded by the TX ring size (xs->max_tx_budget), which
> can be quite large.
>
> Since these pointer arrays can exceed KMALLOC_MAX_SIZE, should this be
> allocated using kvmalloc_array() similar to how the descs array is
> allocated below?

This is also what you pointed out. Even though I thought it's trivial,
I decided to use the _array() to keep safe.

>
> > +     if (!skbs)
> > +             return -ENOMEM;
> > +
> > +     data = kmalloc_array(size, sizeof(void *), GFP_KERNEL);
> > +     if (!data) {
> > +             kfree(skbs);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     descs = kvcalloc(size, sizeof(struct xdp_desc), GFP_KERNEL);
> [ ... ]
> > @@ -2065,6 +1984,11 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
> >       if (!sk)
> >               return -ENOBUFS;
> >
> > +     xs = xdp_sk(sk);
> > +     ret = xsk_init_batch(&xs->batch, 1);
> > +     if (ret)
> > +             return ret;
> > +
>
> Does this leak the sk object if xsk_init_batch() fails?
>
> If xsk_init_batch() returns an error here, the function returns early
> without calling sk_free(sk).

My bad. I should have noticed this....

I'm totally getting lost in so many checks and conditions while
refactoring the xmit path...

Thanks,
Jason


>
> Because sock_init_data() has not been executed yet to link the sk to the
> sock structure, the VFS socket layer will not free the socket memory when
> tearing down the failed creation. Should this path clean up the allocated
> socket before returning?
>
> >       sock->ops = &xsk_proto_ops;
> >
> >       sock_init_data(sock, sk);
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=13

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve
  2026-04-16  2:45     ` Jason Xing
@ 2026-04-16 12:18       ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16 12:18 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Thu, Apr 16, 2026 at 10:45 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Apr 15, 2026 at 5:38 PM <sashiko-bot@kernel.org> wrote:
> >
> > Sashiko AI review found 2 potential issue(s):
> > - [High] Cached data buffers are consumed without validating current allocation flags, potentially violating memory placement constraints.
> > - [Medium] Meaningful code and commit message mismatch regarding cache refill logic.
> > --
> >
> > commit 5a442e9c44f0a64a147908843084cffe5e2fed9f
> > Author: Jason Xing <kernelxing@tencent.com>
> >
> > xsk: cache data buffers to avoid frequently calling kmalloc_reserve
> >
> > This commit introduces a per-socket data buffer cache for small packets
> > to reduce kmalloc_reserve() overhead. It adds bulk allocation from
> > skb_small_head_cache and persists unused buffers across batch cycles.
> >
> > >     Inside the Phase-1 per-descriptor loop, when a small packet needs a
> > >     data buffer and the cache is empty, a single kmem_cache_alloc_bulk()
> > >     refills it with generic_xmit_batch objects. Subsequent small packets
> > >     pop directly from the cache. Large packets bypass the cache entirely
> > >     and fall back to kmalloc_reserve(). Unused buffers remain in the
> > >     cache for the next batch.
> >
> > Is this description accurate? Looking at the implementation below, the bulk
> > allocation happens outside the Phase-1 loop. If the cache runs empty during
> > the loop, it appears to fall back to kmalloc_reserve() for each subsequent
> > packet rather than refilling the cache via kmem_cache_alloc_bulk(). Could
> > the commit message be updated to match the actual execution flow?
>
> Will update it.
>
> >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index c7e42c2966cdd..3b3969adfd23f 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > [ ... ]
> > > @@ -709,10 +718,17 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
> > >
> > >                       skb = skbs[skb_count - 1 - i];
> > >                       skbuff_clear(skb);
> > > -                     data = kmalloc_reserve(&size, gfp_mask, node, skb);
> > > -                     if (unlikely(!data)) {
> > > -                             *err = -ENOBUFS;
> > > -                             break;
> > > +                     if (dc_count &&
> > > +                         SKB_HEAD_ALIGN(size) <= SKB_SMALL_HEAD_CACHE_SIZE) {
> >
> > If the socket's sk_allocation (which populates gfp_mask) was modified to
> > require restricted memory placement (e.g. __GFP_DMA), could this code use a
> > standard memory buffer left over from a previous batch?
> >
> > While the bulk allocation check before the loop correctly avoids allocating
> > from skb_small_head_cache if KMALLOC_NOT_NORMAL_BITS is set, the
> > consumption path here only checks dc_count. Does this need to validate
> > gfp_mask against KMALLOC_NOT_NORMAL_BITS before consuming from the cache,
> > to ensure memory placement constraints are not bypassed?
>
> I can add the check but it seems meaningless to me. In the whole
> implementation of xsk, there is no place where sk_allocation can be
> set to a different value other than GPF_KERNEL.

After a few rounds of investigation, I finally gave up the idea of
using KMALLOC_NOT_NORMAL_BITS. The reason has been explained.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path
  2026-04-15  9:47   ` sashiko-bot
@ 2026-04-16 13:12     ` Jason Xing
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Xing @ 2026-04-16 13:12 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Wed, Apr 15, 2026 at 5:47 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [High] An SKB memory leak occurs when xsk_build_skb() returns an error for a newly assigned SKB.
> - [Medium] The prefetch instruction uses the raw AF_XDP descriptor address without extracting the offset, creating an invalid pointer in unaligned mode.
> --
>
> commit 7bdeef673e571c57c404cb8d05446f46b3e2676e
> Author: Jason Xing <kernelxing@tencent.com>
>
> xsk: optimize xsk_build_skb for batch copy-mode fast path
>
> This commit introduces three targeted optimizations for the batch
> copy-mode TX hot path in XDP sockets, including replacing skb_store_bits()
> with memcpy(), inlining UMEM address computation, and prefetching the
> next descriptor's UMEM data.
>
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 3b3969adfd23f..0fccb0c8cbe90 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -751,14 +751,28 @@ int xsk_alloc_batch_skb(struct xdp_sock *xs, u32 nb_pkts, u32 nb_descs, int *err
> >       if (total_truesize)
> >               refcount_add(total_truesize, &xs->sk.sk_wmem_alloc);
> >
> > -     /* Phase 3: Build SKBs with packet data */
> > +     /* Phase 3: Build SKBs with packet data. */
> > +     struct xsk_buff_pool *pool = xs->pool;
> > +     void *pool_addrs = pool->addrs;
> > +     bool unaligned = pool->unaligned;
> > +
> >       for (j = 0; j < alloc_descs; j++) {
> > +             u64 addr = descs[j].addr;
> > +             void *buffer;
> > +
> > +             if (unaligned)
> > +                     addr = xp_unaligned_add_offset_to_addr(addr);
> > +             buffer = pool_addrs + addr;
> > +
> > +             if (j + 1 < alloc_descs)
> > +                     prefetch(pool_addrs + descs[j + 1].addr);
>
> Does this prefetch use the raw AF_XDP descriptor address without extracting
> the offset?
>
> In unaligned mode, an AF_XDP descriptor addr field encodes a chunk offset
> in its upper 16 bits. While xp_unaligned_add_offset_to_addr() is used to
> decode the current descriptor address, the prefetch directly adds
> descs[j + 1].addr to pool_addrs.
>
> Could this result in an out-of-bounds pointer in unaligned mode, failing to
> hide memory latency and causing cache pollution?

Will handle unaligned mode soon.

>
> > +
> >               if (!xs->skb) {
> >                       skb = skbs[skb_count - 1 - k];
> >                       k++;
> >               }
> >
> > -             skb = xsk_build_skb(xs, skb, &descs[j]);
> > +             skb = xsk_build_skb(xs, skb, &descs[j], buffer);
> >               if (IS_ERR(skb)) {
> >                       *err = PTR_ERR(skb);
> >                       break;
>
> Can this leak the newly assigned SKB if xsk_build_skb() returns an error?

This patch doesn't introduce any new stuff here...

>
> If xsk_build_skb() fails, it returns ERR_PTR(err) and the loop breaks. Since
> k was already incremented, the cleanup loop in Phase 4 (while (k < i)
> kfree_skb(...)) reclaims unused SKBs starting from the current k, skipping
> the failed SKB at index k - 1.
>
> Could a malicious userspace program exploit this to systematically leak SKBs
> by triggering metadata errors?

I confirm this claim is wrong because 1) xsk_build_skb() frees the skb
directly if frag is zero, or 2) __xsk_generic_xmit_batch() takes
responsibility to free xs->skb if frag is not zero and err is not
-EAGAIN.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260415082654.21026-1-kerneljasonxing@gmail.com?part=14

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-04-16 13:13 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15  8:26 [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
2026-04-15  8:51   ` sashiko-bot
2026-04-15 12:46     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb Jason Xing
2026-04-15  8:52   ` sashiko-bot
2026-04-15 13:19     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch Jason Xing
2026-04-15  9:17   ` sashiko-bot
2026-04-16  1:18     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve Jason Xing
2026-04-15  9:38   ` sashiko-bot
2026-04-16  2:45     ` Jason Xing
2026-04-16 12:18       ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function Jason Xing
2026-04-15  9:11   ` sashiko-bot
2026-04-16  3:04     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit Jason Xing
2026-04-15  9:35   ` sashiko-bot
2026-04-16  3:43     ` Jason Xing
2026-04-16  4:50       ` Dmitry Torokhov
2026-04-16  4:51         ` Dmitry Torokhov
2026-04-15  8:26 ` [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path Jason Xing
2026-04-15  9:33   ` sashiko-bot
2026-04-16  5:55     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 08/14] xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 09/14] xsk: extend xskq_cons_read_desc_batch to count nb_pkts Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 10/14] xsk: extend xsk_cq_reserve_locked() to reserve n slots Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic Jason Xing
2026-04-15  9:38   ` sashiko-bot
2026-04-16  9:58     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool Jason Xing
2026-04-15  9:20   ` sashiko-bot
2026-04-16 10:09     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode Jason Xing
2026-04-15  9:18   ` sashiko-bot
2026-04-16 10:33     ` Jason Xing
2026-04-15  8:26 ` [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path Jason Xing
2026-04-15  9:47   ` sashiko-bot
2026-04-16 13:12     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox