[PATCH net v3 0/5] xsk: fix meta and publish of cq issues

BPF List
 help / color / mirror / Atom feed

* [PATCH net v3 0/5] xsk: fix meta and publish of cq issues
@ 2026-05-17  6:33 Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The series is the product of previous review from sashiko[1].

1) META
patch 1: address TOCTOU around metadata.

2) PUBLISH of CQ
patch 2: make sure xsk_addr->addrs[] can be published to cq when
         overflow occurs.
patch 3: keep cleaning up the continuation descs (more than 17) and
         publish its address when overflow occurs.
patch 4: like patch 3, but only handles the invalid descs cases.

[1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/

---
V3
Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
1. avoid breaking previous usage of sendto, and siliently handle
overflow case (Stan, sashiko)
2. add one particular exception process in patch 4 (sashiko)
3. adjust the selftest to make sure it passes in either virutal or
physical machines, which includes add usleep to support physical machine.

V2
Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
1. adjust selftests (Jakub)
2. add READ_ONCE in patch 1 (Stan)

Jason Xing (5):
  xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  xsk: drain continuation descs after overflow in xsk_build_skb()
  xsk: drain continuation descs on invalid descriptor in
    __xsk_generic_xmit()
  selftests/xsk: drain CQ to wait for TX completion

 include/net/xdp_sock.h                        |  1 +
 net/xdp/xsk.c                                 | 44 +++++++++++++----
 .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
 3 files changed, 63 insertions(+), 30 deletions(-)

-- 
2.43.7


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
@ 2026-05-17  6:33 ` Jason Xing
  2026-05-18  6:34   ` sashiko-bot
  2026-05-17  6:33 ` [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.

Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.

Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.

Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e5786cd9af5..f8c8a8c9dfba 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -802,6 +802,7 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 			    u32 hr)
 {
 	struct xsk_tx_metadata *meta = NULL;
+	u16 csum_start, csum_offset;
 
 	if (unlikely(pool->tx_metadata_len == 0))
 		return -EINVAL;
@@ -811,13 +812,15 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 		return -EINVAL;
 
 	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
-		if (unlikely(meta->request.csum_start +
-			     meta->request.csum_offset +
+		csum_start = READ_ONCE(meta->request.csum_start);
+		csum_offset = READ_ONCE(meta->request.csum_offset);
+
+		if (unlikely(csum_start + csum_offset +
 			     sizeof(__sum16) > desc->len))
 			return -EINVAL;
 
-		skb->csum_start = hr + meta->request.csum_start;
-		skb->csum_offset = meta->request.csum_offset;
+		skb->csum_start = hr + csum_start;
+		skb->csum_offset = csum_offset;
 		skb->ip_summed = CHECKSUM_PARTIAL;
 
 		if (unlikely(pool->tx_sw_csum)) {
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
@ 2026-05-17  6:33 ` Jason Xing
  2026-05-21 12:03   ` Maciej Fijalkowski
  2026-05-17  6:33 ` [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.

The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
   the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet

In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.

Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().

Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.

Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().

[1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f8c8a8c9dfba..0a6203c42576 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -793,8 +793,11 @@ static void xsk_consume_skb(struct sk_buff *skb)

 static void xsk_drop_skb(struct sk_buff *skb)
 {
-	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
-	xsk_consume_skb(skb);
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	xs->tx->invalid_descs += xsk_get_num_desc(skb);
+	consume_skb(skb);
+	xs->skb = NULL;
 }

 static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
@@ -876,7 +879,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(-ENOMEM);

 		/* in case of -EOVERFLOW that could happen below,
-		 * xsk_consume_skb() will release this node as whole skb
+		 * xsk_drop_skb() will release this node as whole skb
 		 * would be dropped, which implies freeing all list elements
 		 */
 		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
@@ -968,6 +971,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				goto free_err;
 			}

+			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
 			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
 				err = -EOVERFLOW;
 				goto free_err;
@@ -985,8 +990,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,

 			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
 			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
-			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
 		}
 	}

-- 
2.43.7

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
@ 2026-05-17  6:33 ` Jason Xing
  2026-05-18  6:34   ` sashiko-bot
  2026-05-19 21:19   ` Stanislav Fomichev
  2026-05-17  6:33 ` [PATCH net v3 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
  4 siblings, 2 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
only the current descriptor is released from the TX ring. The remaining
continuation descriptors of the same packet stay in the ring. Since
xs->skb is set to NULL after the drop, the TX loop picks up these
leftover frags and misinterprets each one as the beginning of a new
packet, corrupting the packet stream.

Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
The main TX loop in __xsk_generic_xmit() then handles continuation
descriptors one at a time: each gets a normal CQ reservation (with
backpressure), its address is submitted to the completion queue, and
the descriptor is released from the TX ring. When the last fragment
(without XDP_PKT_CONTD) is processed, the flag is cleared and the
function returns -EOVERFLOW so the next call starts with a fresh
budget for normal packets. This behavior roughly follows how xmit path
treats overflow packets previously: stop sending packets when detecting
the desc has problems. Here, it is stopped only when this group of descs
from the same skb are completed.

This reuses the existing CQ backpressure and budget mechanisms, so if
the CQ is full the function returns -EAGAIN and userspace drains the
CQ before retrying. Zero buffer leakage, zero packet stream corruption.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  1 +
 net/xdp/xsk.c          | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ebac60a3d8a1..8b51876efbed 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -80,6 +80,7 @@ struct xdp_sock {
 	 * call of __xsk_generic_xmit().
 	 */
 	struct sk_buff *skb;
+	bool drain_cont;
 
 	struct list_head map_list;
 	/* Protects map_list */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0a6203c42576..298194b7335e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1016,6 +1016,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			xs->tx->invalid_descs++;
 		}
 		xskq_cons_release(xs->tx);
+		if (xp_mb_desc(desc))
+			xs->drain_cont = true;
 	} else {
 		/* Let application retry */
 		xsk_cq_cancel_locked(xs->pool, 1);
@@ -1062,6 +1064,23 @@ static int __xsk_generic_xmit(struct sock *sk)
 			goto out;
 		}
 
+		if (unlikely(xs->drain_cont)) {
+			unsigned long flags;
+			u32 idx;
+
+			spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
+			idx = xskq_get_prod(xs->pool->cq);
+			xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
+			xskq_prod_submit_n(xs->pool->cq, 1);
+			spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
+
+			xs->tx->invalid_descs++;
+			xskq_cons_release(xs->tx);
+			if (!xp_mb_desc(&desc))
+				xs->drain_cont = false;
+			continue;
+		}
+
 		skb = xsk_build_skb(xs, &desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net v3 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
  2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
                   ` (2 preceding siblings ...)
  2026-05-17  6:33 ` [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
@ 2026-05-17  6:33 ` Jason Xing
  2026-05-17  6:33 ` [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
  4 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When the TX loop in __xsk_generic_xmit() encounters an invalid
descriptor mid-packet (e.g. an out-of-bounds address), the partial
skb is dropped and the offending descriptor is released. However,
remaining continuation descriptors belonging to the same multi-buffer
packet still sit in the TX ring. Since xs->skb becomes NULL after the
drop, the next iteration treats the leftover continuation fragment as
a brand-new packet, corrupting the packet stream.

Fix this by setting the drain_cont flag when the released descriptor
has XDP_PKT_CONTD set. On the next call to __xsk_generic_xmit(), the
drain logic introduced in the previous patch handles the remaining
fragments with normal CQ backpressure.

There is one subtle case: if a subsequent continuation descriptor also
has an invalid address, xskq_cons_peek_desc() rejects it and the
while loop is never entered, so the in-loop drain path cannot clear
drain_cont. The post-loop code already handles this: it sees
xskq_has_descs() is true (the failed descriptor was read but not
released by peek), releases it, and checks its XDP_PKT_CONTD flag.
Add an else branch so that when the released descriptor is the
last fragment (no XDP_PKT_CONTD), drain_cont is cleared. This
prevents the next valid packet from being incorrectly drained.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 298194b7335e..cd451b285645 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1122,6 +1122,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
 		xskq_cons_release(xs->tx);
+		xs->drain_cont = xp_mb_desc(&desc);
 	}

 out:
-- 
2.43.7

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion
  2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
                   ` (3 preceding siblings ...)
  2026-05-17  6:33 ` [PATCH net v3 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
@ 2026-05-17  6:33 ` Jason Xing
  2026-05-18  6:34   ` sashiko-bot
  4 siblings, 1 reply; 15+ messages in thread
From: Jason Xing @ 2026-05-17  6:33 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

After the kernel xsk drain_cont patches, dropped multi-buffer
descriptors get their buffer addresses published to the completion
queue (CQ) via the skb destructor instead of being cancelled. As a
result, the CQ entries observed by user space no longer match the
software-side accounting based on valid_frags only:
__send_pkts() bumps xsk->outstanding_tx by valid_frags, while
complete_pkts() decrements it by every CQ entry it consumes,
including those produced by drops/drains. This makes
outstanding_tx underflow and causes wait_for_tx_completion() to
exit while valid descriptors are still sitting in the TX ring,
which in turn makes receive_pkts() time out for the
ALIGNED_INV_DESC_MULTI_BUFF, UNALIGNED_INV_DESC_MULTI_BUFF and
TOO_MANY_FRAGS subtests.

Fix this with two changes to the TX completion path:
- complete_pkts(): tolerate extra CQ completions by clamping
  outstanding_tx to zero instead of failing.
- wait_for_tx_completion(): after the outstanding_tx loop finishes,
  add a drain loop that kicks TX and consumes remaining CQ entries.
  After the drain loop exits, do a short usleep and one final
  complete_pkts() call so that real hardware (e.g. ice) has enough
  time to post late CQ entries before we conclude the ring is
  fully drained.

Adjust the multi-buffer invalid-desc tests so that the last
descriptor of every invalid packet has XDP_PKT_CONTD cleared.
Without this, the kernel drain_cont logic would consume
descriptors past the packet boundary and eat into the next valid
packet, breaking pkt_nb validation. Concretely:
- XSK_DESC__INVALID_OPTION is changed from 0xffff to 0xfffe so it
  no longer asserts the XDP_PKT_CONTD bit (bit 0).
- testapp_invalid_desc_mb() clears XDP_PKT_CONTD on the trailing
  descriptor of the invalid-address and invalid-length packets.
- testapp_too_many_frags() appends one extra terminating
  descriptor so the over-sized invalid packet ends with
  XDP_PKT_CONTD cleared, preventing the drain from spilling into
  the trailing sync packet.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7950c504ed28..1f196c8ebc73 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -31,7 +31,7 @@
 #define POLL_TMOUT			1000
 #define THREAD_TMOUT			3
 #define UMEM_HEADROOM_TEST_SIZE		128
-#define XSK_DESC__INVALID_OPTION	(0xffff)
+#define XSK_DESC__INVALID_OPTION	(0xfffe)
 #define XSK_UMEM__INVALID_FRAME_SIZE	(MAX_ETH_JUMBO_SIZE + 1)
 #define XSK_UMEM__LARGE_FRAME_SIZE	(3 * 1024)
 #define XSK_UMEM__MAX_FRAME_SIZE	(4 * 1024)
@@ -950,17 +950,11 @@ static int complete_pkts(struct xsk_socket_info *xsk, int batch_size)
 
 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, batch_size, &idx);
 	if (rcvd) {
-		if (rcvd > xsk->outstanding_tx) {
-			u64 addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx + rcvd - 1);
-
-			ksft_print_msg("[%s] Too many packets completed\n", __func__);
-			ksft_print_msg("Last completion address: %llx\n",
-				       (unsigned long long)addr);
-			return TEST_FAILURE;
-		}
-
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
-		xsk->outstanding_tx -= rcvd;
+		if (rcvd > xsk->outstanding_tx)
+			xsk->outstanding_tx = 0;
+		else
+			xsk->outstanding_tx -= rcvd;
 	}
 
 	return TEST_PASS;
@@ -1274,6 +1268,8 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 static int wait_for_tx_completion(struct xsk_socket_info *xsk)
 {
 	struct timeval tv_end, tv_now, tv_timeout = {THREAD_TMOUT, 0};
+	unsigned int rcvd;
+	u32 idx;
 	int ret;
 
 	ret = gettimeofday(&tv_now, NULL);
@@ -1293,6 +1289,17 @@ static int wait_for_tx_completion(struct xsk_socket_info *xsk)
 		complete_pkts(xsk, xsk->batch_size);
 	}
 
+	do {
+		if (xsk_ring_prod__needs_wakeup(&xsk->tx))
+			kick_tx(xsk);
+		rcvd = xsk_ring_cons__peek(&xsk->umem->cq, xsk->batch_size, &idx);
+		if (rcvd)
+			xsk_ring_cons__release(&xsk->umem->cq, rcvd);
+	} while (rcvd);
+
+	usleep(100);
+	complete_pkts(xsk, xsk->batch_size);
+
 	return TEST_PASS;
 }
 
@@ -2075,10 +2082,10 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		{0, 0, 0, false, 0},
 		/* Invalid address in the second frame */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, 0},
 		/* Invalid len in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, 0},
 		/* Invalid options in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
@@ -2229,7 +2236,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(2 * max_frags + 3, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2247,20 +2254,19 @@ int testapp_too_many_frags(struct test_spec *test)
 	}
 	pkts[max_frags].options = 0;
 
-	/* An invalid packet with the max amount of frags but signals packet
-	 * continues on the last frag
-	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+	/* An invalid packet with too many frags */
+	for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
 		pkts[i].len = MIN_PKT_SIZE;
 		pkts[i].options = XDP_PKT_CONTD;
 		pkts[i].valid = false;
 	}
+	pkts[2 * max_frags + 1].options = 0;
 
 	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[2 * max_frags + 2].len = MIN_PKT_SIZE;
+	pkts[2 * max_frags + 2].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 3)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  2026-05-17  6:33 ` [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
@ 2026-05-18  6:34   ` sashiko-bot
  2026-05-18  9:06     ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: sashiko-bot @ 2026-05-18  6:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] The `skb->csum_start` assignment in `xsk_skb_metadata` can silently wrap around 16 bits on architectures with large page sizes (e.g., 64KB), leading to negative checksum offsets and hardware descriptor corruption.
- [High] Software checksum fallback (`skb_checksum_help`) in `xsk_build_skb_zerocopy()` operates on an empty linear SKB, unconditionally failing and dropping packets.
- [High] Returning `-EINVAL` from `xsk_skb_metadata` permanently wedges the AF_XDP TX queue, causing a local Denial of Service.
--

commit a101fbe76c494016d6edebf7d5d5e0dc173c0af5
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()

The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. This commit fixes a TOCTOU
race by reading these values into local variables once.

Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 5e5786cd9af55..f8c8a8c9dfba5 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -811,13 +812,15 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
>  		return -EINVAL;
>  
>  	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
> -		if (unlikely(meta->request.csum_start +
> -			     meta->request.csum_offset +
> +		csum_start = READ_ONCE(meta->request.csum_start);
> +		csum_offset = READ_ONCE(meta->request.csum_offset);
> +
> +		if (unlikely(csum_start + csum_offset +
>  			     sizeof(__sum16) > desc->len))
>  			return -EINVAL;

Does returning -EINVAL here wedge the AF_XDP TX queue?

If xsk_skb_metadata() rejects invalid metadata and returns -EINVAL,
xsk_build_skb() appears to propagate this error without advancing the
consumer pointer via xskq_cons_release(xs->tx).

When __xsk_generic_xmit() receives the -EINVAL error, it bypasses
the post-loop cleanup and the invalid descriptor is never consumed.

Could this put the socket into an infinite loop on subsequent sendmsg() calls?

> -		skb->csum_start = hr + meta->request.csum_start;
> -		skb->csum_offset = meta->request.csum_offset;
> +		skb->csum_start = hr + csum_start;
> +		skb->csum_offset = csum_offset;

Could this addition overflow skb->csum_start on architectures with large page
sizes?

Since desc->len can be up to PAGE_SIZE (e.g., 65536 bytes on ARM64 or
PowerPC), csum_start could be as large as 65534. Adding the driver headroom
(hr) could cause the sum to exceed 65535, truncating when assigned to the
16-bit skb->csum_start field.

When the driver later computes skb_checksum_start_offset(skb), could this
result in a negative integer and cause out-of-bounds offsets to be programmed
into the hardware?

>  		skb->ip_summed = CHECKSUM_PARTIAL;
>  
>  		if (unlikely(pool->tx_sw_csum)) {
>  			int err;
>  
>  			err = skb_checksum_help(skb);
>  			if (err)
>  				return err;
>  		}

Does skb_checksum_help() unconditionally fail here for zero-copy AF_XDP
sockets?

xsk_build_skb_zerocopy() creates a non-linear SKB where data is stored in
fragments, meaning skb_headlen(skb) is 0.

Inside skb_checksum_help(), it checks if the offset is greater than or equal
to skb_headlen(skb). Since skb_headlen() is 0, this check always evaluates
to true, triggering a WARN_ONCE and returning -EINVAL.

Could this cause all checksummed packets to be dropped when hardware offload
is unsupported and pool->tx_sw_csum is true?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517063311.28921-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-17  6:33 ` [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
@ 2026-05-18  6:34   ` sashiko-bot
  2026-05-18  9:09     ` Jason Xing
  2026-05-19 21:19   ` Stanislav Fomichev
  1 sibling, 1 reply; 15+ messages in thread
From: sashiko-bot @ 2026-05-18  6:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The commit message claims the function returns `-EOVERFLOW` after draining is complete, but the code merely uses `continue;`.
--

commit 5c0ce4e27ce234e8586a243e61aed44e90c34548
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: drain continuation descs after overflow in xsk_build_skb()

This patch adds a drain_cont flag to xdp_sock to properly drain leftover
continuation descriptors when a multi-buffer packet exceeds MAX_SKB_FRAGS.
The main TX loop handles these leftover descriptors individually by returning
them to the completion queue with standard backpressure mechanisms.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

> When the last fragment
> (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> function returns -EOVERFLOW so the next call starts with a fresh
> budget for normal packets.

Is this description accurate?

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0a6203c425766..298194b7335e1 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[ ... ]

> @@ -1062,6 +1064,23 @@ static int __xsk_generic_xmit(struct sock *sk)
>  			goto out;
>  		}
>  
> +		if (unlikely(xs->drain_cont)) {
> +			unsigned long flags;
> +			u32 idx;
> +
> +			spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> +			idx = xskq_get_prod(xs->pool->cq);
> +			xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> +			xskq_prod_submit_n(xs->pool->cq, 1);
> +			spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> +
> +			xs->tx->invalid_descs++;
> +			xskq_cons_release(xs->tx);
> +			if (!xp_mb_desc(&desc))
> +				xs->drain_cont = false;
> +			continue;

Looking at the implementation here, the code clears the flag and executes
a continue statement, which proceeds to the next descriptor using the
remaining budget instead of returning -EOVERFLOW.

Could the code and the commit message be out of sync?

> +		}
> +
>  		skb = xsk_build_skb(xs, &desc);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517063311.28921-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion
  2026-05-17  6:33 ` [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
@ 2026-05-18  6:34   ` sashiko-bot
  2026-05-18  9:19     ` Jason Xing
  0 siblings, 1 reply; 15+ messages in thread
From: sashiko-bot @ 2026-05-18  6:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The racy drain loop in `wait_for_tx_completion()` can exit prematurely and fail to drain pending CQ entries, leading to test flakiness.
--

commit 25f4d14929e097cf7028dd26ec4e4bae16aa1a5b
Author: Jason Xing <kerneljasonxing@gmail.com>

selftests/xsk: drain CQ to wait for TX completion

This patch addresses test timeouts in xsk multi-buffer invalid-desc tests
caused by dropped descriptors publishing their buffer addresses to the
completion queue. It clamps outstanding_tx to zero to handle extra CQ
completions and adds a drain loop to consume remaining entries.

> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index 7950c504ed289..1f196c8ebc738 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c

[ ... ]

> @@ -1293,6 +1289,17 @@ static int wait_for_tx_completion(struct xsk_socket_info *xsk)
>  		complete_pkts(xsk, xsk->batch_size);
>  	}
>  
> +	do {
> +		if (xsk_ring_prod__needs_wakeup(&xsk->tx))
> +			kick_tx(xsk);
> +		rcvd = xsk_ring_cons__peek(&xsk->umem->cq, xsk->batch_size, &idx);
> +		if (rcvd)
> +			xsk_ring_cons__release(&xsk->umem->cq, rcvd);
> +	} while (rcvd);

Could this drain loop exit prematurely? If the hardware or kernel is slightly
delayed in posting the remaining completions, xsk_ring_cons__peek() will
return 0 on the first iteration, causing the loop to terminate immediately
without waiting.

> +
> +	usleep(100);
> +	complete_pkts(xsk, xsk->batch_size);

Relying on a hardcoded usleep(100) and a single complete_pkts() call might
not be sufficient if the hardware takes longer than 100us or if the number
of remaining completions exceeds xsk->batch_size.

Would it be more robust to track the total number of submitted descriptors
(including invalid ones) in outstanding_tx instead of using a timeout-based
drain?

> +
>  	return TEST_PASS;
>  }
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260517063311.28921-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  2026-05-18  6:34   ` sashiko-bot
@ 2026-05-18  9:06     ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-18  9:06 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On Mon, May 18, 2026 at 2:34 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [Critical] The `skb->csum_start` assignment in `xsk_skb_metadata` can silently wrap around 16 bits on architectures with large page sizes (e.g., 64KB), leading to negative checksum offsets and hardware descriptor corruption.
> - [High] Software checksum fallback (`skb_checksum_help`) in `xsk_build_skb_zerocopy()` operates on an empty linear SKB, unconditionally failing and dropping packets.
> - [High] Returning `-EINVAL` from `xsk_skb_metadata` permanently wedges the AF_XDP TX queue, causing a local Denial of Service.
> --
>
> commit a101fbe76c494016d6edebf7d5d5e0dc173c0af5
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
>
> The TX metadata area resides in the UMEM buffer which is memory-mapped
> and concurrently writable by userspace. In xsk_skb_metadata(),
> csum_start and csum_offset are read from shared memory for bounds
> validation, then read again for skb assignment. This commit fixes a TOCTOU
> race by reading these values into local variables once.
>
> Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
> Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 5e5786cd9af55..f8c8a8c9dfba5 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -811,13 +812,15 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> >               return -EINVAL;
> >
> >       if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
> > -             if (unlikely(meta->request.csum_start +
> > -                          meta->request.csum_offset +
> > +             csum_start = READ_ONCE(meta->request.csum_start);
> > +             csum_offset = READ_ONCE(meta->request.csum_offset);
> > +
> > +             if (unlikely(csum_start + csum_offset +
> >                            sizeof(__sum16) > desc->len))
> >                       return -EINVAL;
>
> Does returning -EINVAL here wedge the AF_XDP TX queue?
>
> If xsk_skb_metadata() rejects invalid metadata and returns -EINVAL,
> xsk_build_skb() appears to propagate this error without advancing the
> consumer pointer via xskq_cons_release(xs->tx).
>
> When __xsk_generic_xmit() receives the -EINVAL error, it bypasses
> the post-loop cleanup and the invalid descriptor is never consumed.
>
> Could this put the socket into an infinite loop on subsequent sendmsg() calls?

I should state that the first two bugs are valid and pre-existing, so
no action needs to be performed in this series.

Yes, it will cause xsk to get stuck in the send syscall infinitely. I
think -EINVAL should be dealt with the same method as -EOVERFLOW:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e5786cd9af5..c6cf2434a7c9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1010,6 +1010,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
                        xs->tx->invalid_descs++;
                }
                xskq_cons_release(xs->tx);
+       } else if (err == -EINVAL) {
+               xsk_cq_cancel_locked(xs->pool, 1);
+               xs->tx->invalid_descs++;
+               xskq_cons_release(xs->tx);
        } else {
                /* Let application retry */
                xsk_cq_cancel_locked(xs->pool, 1);
@@ -1059,7 +1063,7 @@ static int __xsk_generic_xmit(struct sock *sk)
                skb = xsk_build_skb(xs, &desc);
                if (IS_ERR(skb)) {
                        err = PTR_ERR(skb);
-                       if (err != -EOVERFLOW)
+                       if (err != -EOVERFLOW && err != -EINVAL)
                                goto out;
                        err = 0;
                        continue;

>
> > -             skb->csum_start = hr + meta->request.csum_start;
> > -             skb->csum_offset = meta->request.csum_offset;
> > +             skb->csum_start = hr + csum_start;
> > +             skb->csum_offset = csum_offset;
>
> Could this addition overflow skb->csum_start on architectures with large page
> sizes?
>
> Since desc->len can be up to PAGE_SIZE (e.g., 65536 bytes on ARM64 or
> PowerPC), csum_start could be as large as 65534. Adding the driver headroom
> (hr) could cause the sum to exceed 65535, truncating when assigned to the
> 16-bit skb->csum_start field.
>
> When the driver later computes skb_checksum_start_offset(skb), could this
> result in a negative integer and cause out-of-bounds offsets to be programmed
> into the hardware?

Ack, it should be checked in advance like this:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e5786cd9af5..1fd7a70e5d38 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -816,6 +816,9 @@ static int xsk_skb_metadata(struct sk_buff *skb,
void *buffer,
                             sizeof(__sum16) > desc->len))
                        return -EINVAL;

+               if (unlikely((u32)hr + csum_start > U16_MAX))
+                       return -EINVAL;
+
                skb->csum_start = hr + meta->request.csum_start;
                skb->csum_offset = meta->request.csum_offset;
                skb->ip_summed = CHECKSUM_PARTIAL;

>
> >               skb->ip_summed = CHECKSUM_PARTIAL;
> >
> >               if (unlikely(pool->tx_sw_csum)) {
> >                       int err;
> >
> >                       err = skb_checksum_help(skb);
> >                       if (err)
> >                               return err;
> >               }
>
> Does skb_checksum_help() unconditionally fail here for zero-copy AF_XDP
> sockets?
>
> xsk_build_skb_zerocopy() creates a non-linear SKB where data is stored in
> fragments, meaning skb_headlen(skb) is 0.
>
> Inside skb_checksum_help(), it checks if the offset is greater than or equal
> to skb_headlen(skb). Since skb_headlen() is 0, this check always evaluates
> to true, triggering a WARN_ONCE and returning -EINVAL.
>
> Could this cause all checksummed packets to be dropped when hardware offload
> is unsupported and pool->tx_sw_csum is true?

I don't think we need to move the check ahead as warn_once is simply enough...

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-18  6:34   ` sashiko-bot
@ 2026-05-18  9:09     ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-18  9:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On Mon, May 18, 2026 at 2:34 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] The commit message claims the function returns `-EOVERFLOW` after draining is complete, but the code merely uses `continue;`.
> --
>
> commit 5c0ce4e27ce234e8586a243e61aed44e90c34548
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: drain continuation descs after overflow in xsk_build_skb()
>
> This patch adds a drain_cont flag to xdp_sock to properly drain leftover
> continuation descriptors when a multi-buffer packet exceeds MAX_SKB_FRAGS.
> The main TX loop handles these leftover descriptors individually by returning
> them to the completion queue with standard backpressure mechanisms.
>
> Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
>
> > When the last fragment
> > (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> > function returns -EOVERFLOW so the next call starts with a fresh
> > budget for normal packets.
>
> Is this description accurate?
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 0a6203c425766..298194b7335e1 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
>
> [ ... ]
>
> > @@ -1062,6 +1064,23 @@ static int __xsk_generic_xmit(struct sock *sk)
> >                       goto out;
> >               }
> >
> > +             if (unlikely(xs->drain_cont)) {
> > +                     unsigned long flags;
> > +                     u32 idx;
> > +
> > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > +                     idx = xskq_get_prod(xs->pool->cq);
> > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> > +
> > +                     xs->tx->invalid_descs++;
> > +                     xskq_cons_release(xs->tx);
> > +                     if (!xp_mb_desc(&desc))
> > +                             xs->drain_cont = false;
> > +                     continue;
>
> Looking at the implementation here, the code clears the flag and executes
> a continue statement, which proceeds to the next descriptor using the
> remaining budget instead of returning -EOVERFLOW.
>
> Could the code and the commit message be out of sync?

Right, will reword the commit log:
"When the last fragment (without XDP_PKT_CONTD) is processed, the flag
is cleared and the loop continues to process subsequent descriptors
with the remaining budget."

Thanks,
Jason

>
> > +             }
> > +
> >               skb = xsk_build_skb(xs, &desc);
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260517063311.28921-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion
  2026-05-18  6:34   ` sashiko-bot
@ 2026-05-18  9:19     ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-18  9:19 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On Mon, May 18, 2026 at 2:34 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] The racy drain loop in `wait_for_tx_completion()` can exit prematurely and fail to drain pending CQ entries, leading to test flakiness.
> --
>
> commit 25f4d14929e097cf7028dd26ec4e4bae16aa1a5b
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> selftests/xsk: drain CQ to wait for TX completion
>
> This patch addresses test timeouts in xsk multi-buffer invalid-desc tests
> caused by dropped descriptors publishing their buffer addresses to the
> completion queue. It clamps outstanding_tx to zero to handle extra CQ
> completions and adds a drain loop to consume remaining entries.
>
> > diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > index 7950c504ed289..1f196c8ebc738 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
>
> [ ... ]
>
> > @@ -1293,6 +1289,17 @@ static int wait_for_tx_completion(struct xsk_socket_info *xsk)
> >               complete_pkts(xsk, xsk->batch_size);
> >       }
> >
> > +     do {
> > +             if (xsk_ring_prod__needs_wakeup(&xsk->tx))
> > +                     kick_tx(xsk);
> > +             rcvd = xsk_ring_cons__peek(&xsk->umem->cq, xsk->batch_size, &idx);
> > +             if (rcvd)
> > +                     xsk_ring_cons__release(&xsk->umem->cq, rcvd);
> > +     } while (rcvd);
>
> Could this drain loop exit prematurely? If the hardware or kernel is slightly
> delayed in posting the remaining completions, xsk_ring_cons__peek() will
> return 0 on the first iteration, causing the loop to terminate immediately
> without waiting.
>
> > +
> > +     usleep(100);
> > +     complete_pkts(xsk, xsk->batch_size);
>
> Relying on a hardcoded usleep(100) and a single complete_pkts() call might
> not be sufficient if the hardware takes longer than 100us or if the number
> of remaining completions exceeds xsk->batch_size.
>
> Would it be more robust to track the total number of submitted descriptors
> (including invalid ones) in outstanding_tx instead of using a timeout-based
> drain?

No, I tried before. It caused more failures of xsk selftests because
not all the descs are published in the cq:
1) address-invalid descriptors are silently dropped by
xskq_cons_peek_desc() via xp_validate_desc() without writing any CQ
entry.
2) only drain_cont descriptors and successfully sent descriptors
produce CQ entries.

Since we cannot predict at submission time which descriptors will hit
the drain_cont path, outstanding_tx cannot precisely track the number
of expected CQ entries. The do-while drain loop handles this gap. That
means this current impl is a conservative safety way.

Thanks,
Jason

>
> > +
> >       return TEST_PASS;
> >  }
> >
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260517063311.28921-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-17  6:33 ` [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
  2026-05-18  6:34   ` sashiko-bot
@ 2026-05-19 21:19   ` Stanislav Fomichev
  2026-05-19 23:20     ` Jason Xing
  1 sibling, 1 reply; 15+ messages in thread
From: Stanislav Fomichev @ 2026-05-19 21:19 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On 05/17, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> only the current descriptor is released from the TX ring. The remaining
> continuation descriptors of the same packet stay in the ring. Since
> xs->skb is set to NULL after the drop, the TX loop picks up these
> leftover frags and misinterprets each one as the beginning of a new
> packet, corrupting the packet stream.
> 
> Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
> The main TX loop in __xsk_generic_xmit() then handles continuation
> descriptors one at a time: each gets a normal CQ reservation (with
> backpressure), its address is submitted to the completion queue, and
> the descriptor is released from the TX ring. When the last fragment
> (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> function returns -EOVERFLOW so the next call starts with a fresh
> budget for normal packets. This behavior roughly follows how xmit path
> treats overflow packets previously: stop sending packets when detecting
> the desc has problems. Here, it is stopped only when this group of descs
> from the same skb are completed.
> 
> This reuses the existing CQ backpressure and budget mechanisms, so if
> the CQ is full the function returns -EAGAIN and userspace drains the
> CQ before retrying. Zero buffer leakage, zero packet stream corruption.
> 
> Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/xdp_sock.h |  1 +
>  net/xdp/xsk.c          | 19 +++++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ebac60a3d8a1..8b51876efbed 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -80,6 +80,7 @@ struct xdp_sock {
>  	 * call of __xsk_generic_xmit().
>  	 */
>  	struct sk_buff *skb;
> +	bool drain_cont;
>  
>  	struct list_head map_list;
>  	/* Protects map_list */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0a6203c42576..298194b7335e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -1016,6 +1016,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  			xs->tx->invalid_descs++;
>  		}
>  		xskq_cons_release(xs->tx);

[..]

> +		if (xp_mb_desc(desc))
> +			xs->drain_cont = true;

Since you're gonna be addressing sashiko comment, should we also move this
part to __xsk_generic_xmit? Right after err=0? Feels like having
both true/false/check in the same function is a bit cleaner?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-19 21:19   ` Stanislav Fomichev
@ 2026-05-19 23:20     ` Jason Xing
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Xing @ 2026-05-19 23:20 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 20, 2026 at 5:19 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> On 05/17, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > only the current descriptor is released from the TX ring. The remaining
> > continuation descriptors of the same packet stay in the ring. Since
> > xs->skb is set to NULL after the drop, the TX loop picks up these
> > leftover frags and misinterprets each one as the beginning of a new
> > packet, corrupting the packet stream.
> >
> > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
> > The main TX loop in __xsk_generic_xmit() then handles continuation
> > descriptors one at a time: each gets a normal CQ reservation (with
> > backpressure), its address is submitted to the completion queue, and
> > the descriptor is released from the TX ring. When the last fragment
> > (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> > function returns -EOVERFLOW so the next call starts with a fresh
> > budget for normal packets. This behavior roughly follows how xmit path
> > treats overflow packets previously: stop sending packets when detecting
> > the desc has problems. Here, it is stopped only when this group of descs
> > from the same skb are completed.
> >
> > This reuses the existing CQ backpressure and budget mechanisms, so if
> > the CQ is full the function returns -EAGAIN and userspace drains the
> > CQ before retrying. Zero buffer leakage, zero packet stream corruption.
> >
> > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/xdp_sock.h |  1 +
> >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> >  2 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index ebac60a3d8a1..8b51876efbed 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -80,6 +80,7 @@ struct xdp_sock {
> >        * call of __xsk_generic_xmit().
> >        */
> >       struct sk_buff *skb;
> > +     bool drain_cont;
> >
> >       struct list_head map_list;
> >       /* Protects map_list */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 0a6203c42576..298194b7335e 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -1016,6 +1016,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                       xs->tx->invalid_descs++;
> >               }
> >               xskq_cons_release(xs->tx);
>
> [..]
>
> > +             if (xp_mb_desc(desc))
> > +                     xs->drain_cont = true;
>
> Since you're gonna be addressing sashiko comment, should we also move this
> part to __xsk_generic_xmit? Right after err=0? Feels like having
> both true/false/check in the same function is a bit cleaner?

It makes sense. Thanks. Will do that:)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  2026-05-17  6:33 ` [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
@ 2026-05-21 12:03   ` Maciej Fijalkowski
  0 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:03 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Sun, May 17, 2026 at 02:33:08PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch is inspired by the check[1] from sashiko. It says when
> overflow happens, the address of cq to be published is invalid.
> Actually the severer thing is the whole process of publishing the
> address of cq in this particular case is not right: it should truely
> publish the address and advance the cached_prod in cq as long as it
> reads descriptors from txq.
> 
> The following is the full analysis.
> xsk_drop_skb() is called in three places, which all discard a partially
> built multi-buffer skb:
> 1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
> 2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
>    the TX ring prevents the partial packet from completing
> 3) xsk_release(): socket close while xs->skb holds an incomplete packet
> 
> In all three cases, the TX descriptors for the already-processed frags
> have been consumed from the TX ring (xskq_cons_release), and CQ slots
> have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
> which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
> the buffer addresses never appear in the completion queue, userspace
> permanently loses track of these buffers.
> 
> Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
> destructor, which already submits buffer addresses to the CQ via
> xsk_cq_submit_addr_locked().
> 
> Note that cancelling the descriptors back to the TX ring (via
> xskq_cons_cancel_n) is not a appropriate option because an oversized
> packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
> which is an obviously deadlock bug in the TX path.
> 
> Also move the desc->addr assignment in xsk_build_skb() above the
> overflow check so that the current descriptor's address is recorded
> before a potential -EOVERFLOW jump to free_err, consistent with the
> zerocopy path in xsk_build_skb_zerocopy().
> 
> [1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> 
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> ---
>  net/xdp/xsk.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index f8c8a8c9dfba..0a6203c42576 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -793,8 +793,11 @@ static void xsk_consume_skb(struct sk_buff *skb)
>  
>  static void xsk_drop_skb(struct sk_buff *skb)
>  {
> -	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
> -	xsk_consume_skb(skb);
> +	struct xdp_sock *xs = xdp_sk(skb->sk);
> +
> +	xs->tx->invalid_descs += xsk_get_num_desc(skb);
> +	consume_skb(skb);
> +	xs->skb = NULL;
>  }
>  
>  static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> @@ -876,7 +879,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  			return ERR_PTR(-ENOMEM);
>  
>  		/* in case of -EOVERFLOW that could happen below,
> -		 * xsk_consume_skb() will release this node as whole skb
> +		 * xsk_drop_skb() will release this node as whole skb
>  		 * would be dropped, which implies freeing all list elements
>  		 */
>  		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> @@ -968,6 +971,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  				goto free_err;
>  			}
>  
> +			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> +
>  			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
>  				err = -EOVERFLOW;
>  				goto free_err;
> @@ -985,8 +990,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  
>  			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
>  			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
> -
> -			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
>  		}
>  	}
>  
> -- 
> 2.43.7
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-05-21 12:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-17  6:33 [PATCH net v3 0/5] xsk: fix meta and publish of cq issues Jason Xing
2026-05-17  6:33 ` [PATCH net v3 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-18  6:34   ` sashiko-bot
2026-05-18  9:06     ` Jason Xing
2026-05-17  6:33 ` [PATCH net v3 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-21 12:03   ` Maciej Fijalkowski
2026-05-17  6:33 ` [PATCH net v3 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-18  6:34   ` sashiko-bot
2026-05-18  9:09     ` Jason Xing
2026-05-19 21:19   ` Stanislav Fomichev
2026-05-19 23:20     ` Jason Xing
2026-05-17  6:33 ` [PATCH net v3 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-17  6:33 ` [PATCH net v3 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
2026-05-18  6:34   ` sashiko-bot
2026-05-18  9:19     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox