[PATCH net v4 0/5] xsk: fix meta and publish of cq issues

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
@ 2026-05-20  0:42 Jason Xing
  2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The series is the product of previous review from sashiko[1].

1) META
patch 1: address TOCTOU around metadata.

2) PUBLISH of CQ
patch 2: make sure xsk_addr->addrs[] can be published to cq when
         overflow occurs.
patch 3: keep cleaning up the continuation descs (more than 17) and
         publish its address when overflow occurs.
patch 4: like patch 3, but only handles the invalid descs cases.

[1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/

---
V4
Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
1. correct the description of xmit path in patch 3 (sashiko)
2. move set logic into xmit path in patch 3 (Stan)

V3
Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
1. avoid breaking previous usage of sendto, and siliently handle
overflow case (Stan, sashiko)
2. add one particular exception process in patch 4 (sashiko)
3. adjust the selftest to make sure it passes in either virutal or
physical machines, which includes add usleep to support physical machine.

V2
Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
1. adjust selftests (Jakub)
2. add READ_ONCE in patch 1 (Stan)


Jason Xing (5):
  xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  xsk: drain continuation descs after overflow in xsk_build_skb()
  xsk: drain continuation descs on invalid descriptor in
    __xsk_generic_xmit()
  selftests/xsk: drain CQ to wait for TX completion

 include/net/xdp_sock.h                        |  1 +
 net/xdp/xsk.c                                 | 44 +++++++++++++----
 .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
 3 files changed, 63 insertions(+), 30 deletions(-)

-- 
2.43.7


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
@ 2026-05-20  0:42 ` Jason Xing
  2026-05-21 12:04   ` Maciej Fijalkowski
  2026-05-20  0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.

Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.

Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.

Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e5786cd9af5..f8c8a8c9dfba 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -802,6 +802,7 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 			    u32 hr)
 {
 	struct xsk_tx_metadata *meta = NULL;
+	u16 csum_start, csum_offset;
 
 	if (unlikely(pool->tx_metadata_len == 0))
 		return -EINVAL;
@@ -811,13 +812,15 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 		return -EINVAL;
 
 	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
-		if (unlikely(meta->request.csum_start +
-			     meta->request.csum_offset +
+		csum_start = READ_ONCE(meta->request.csum_start);
+		csum_offset = READ_ONCE(meta->request.csum_offset);
+
+		if (unlikely(csum_start + csum_offset +
 			     sizeof(__sum16) > desc->len))
 			return -EINVAL;
 
-		skb->csum_start = hr + meta->request.csum_start;
-		skb->csum_offset = meta->request.csum_offset;
+		skb->csum_start = hr + csum_start;
+		skb->csum_offset = csum_offset;
 		skb->ip_summed = CHECKSUM_PARTIAL;
 
 		if (unlikely(pool->tx_sw_csum)) {
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
@ 2026-05-21 12:04   ` Maciej Fijalkowski
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:04 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 20, 2026 at 08:42:40AM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> The TX metadata area resides in the UMEM buffer which is memory-mapped
> and concurrently writable by userspace. In xsk_skb_metadata(),
> csum_start and csum_offset are read from shared memory for bounds
> validation, then read again for skb assignment. A malicious userspace
> application can race to overwrite these values between the two reads,
> bypassing the bounds check and causing out-of-bounds memory access
> during checksum computation in the transmit path.
> 
> Fix this by reading csum_start and csum_offset into local variables
> once, then using the local copies for both validation and assignment.
> 
> Note that other metadata fields (flags, launch_time) and the cached
> csum fields may be mutually inconsistent due to concurrent userspace
> writes, but this is benign: the only security-critical invariant is
> that each field's validated value is the same one used, which local
> caching guarantees.
> 
> Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
> Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> ---
>  net/xdp/xsk.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 5e5786cd9af5..f8c8a8c9dfba 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -802,6 +802,7 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
>  			    u32 hr)
>  {
>  	struct xsk_tx_metadata *meta = NULL;
> +	u16 csum_start, csum_offset;
>  
>  	if (unlikely(pool->tx_metadata_len == 0))
>  		return -EINVAL;
> @@ -811,13 +812,15 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
>  		return -EINVAL;
>  
>  	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
> -		if (unlikely(meta->request.csum_start +
> -			     meta->request.csum_offset +
> +		csum_start = READ_ONCE(meta->request.csum_start);
> +		csum_offset = READ_ONCE(meta->request.csum_offset);
> +
> +		if (unlikely(csum_start + csum_offset +
>  			     sizeof(__sum16) > desc->len))
>  			return -EINVAL;
>  
> -		skb->csum_start = hr + meta->request.csum_start;
> -		skb->csum_offset = meta->request.csum_offset;
> +		skb->csum_start = hr + csum_start;
> +		skb->csum_offset = csum_offset;
>  		skb->ip_summed = CHECKSUM_PARTIAL;
>  
>  		if (unlikely(pool->tx_sw_csum)) {
> -- 
> 2.43.7
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
  2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
@ 2026-05-20  0:42 ` Jason Xing
  2026-05-21 12:05   ` Maciej Fijalkowski
  2026-05-20  0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.

The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
   the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet

In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.

Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().

Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.

Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().

[1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f8c8a8c9dfba..0a6203c42576 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -793,8 +793,11 @@ static void xsk_consume_skb(struct sk_buff *skb)

 static void xsk_drop_skb(struct sk_buff *skb)
 {
-	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
-	xsk_consume_skb(skb);
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	xs->tx->invalid_descs += xsk_get_num_desc(skb);
+	consume_skb(skb);
+	xs->skb = NULL;
 }

 static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
@@ -876,7 +879,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(-ENOMEM);

 		/* in case of -EOVERFLOW that could happen below,
-		 * xsk_consume_skb() will release this node as whole skb
+		 * xsk_drop_skb() will release this node as whole skb
 		 * would be dropped, which implies freeing all list elements
 		 */
 		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
@@ -968,6 +971,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				goto free_err;
 			}

+			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
 			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
 				err = -EOVERFLOW;
 				goto free_err;
@@ -985,8 +990,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,

 			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
 			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
-			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
 		}
 	}

-- 
2.43.7

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  2026-05-20  0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
@ 2026-05-21 12:05   ` Maciej Fijalkowski
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:05 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 20, 2026 at 08:42:41AM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch is inspired by the check[1] from sashiko. It says when
> overflow happens, the address of cq to be published is invalid.
> Actually the severer thing is the whole process of publishing the
> address of cq in this particular case is not right: it should truely
> publish the address and advance the cached_prod in cq as long as it
> reads descriptors from txq.
> 
> The following is the full analysis.
> xsk_drop_skb() is called in three places, which all discard a partially
> built multi-buffer skb:
> 1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
> 2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
>    the TX ring prevents the partial packet from completing
> 3) xsk_release(): socket close while xs->skb holds an incomplete packet
> 
> In all three cases, the TX descriptors for the already-processed frags
> have been consumed from the TX ring (xskq_cons_release), and CQ slots
> have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
> which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
> the buffer addresses never appear in the completion queue, userspace
> permanently loses track of these buffers.
> 
> Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
> destructor, which already submits buffer addresses to the CQ via
> xsk_cq_submit_addr_locked().
> 
> Note that cancelling the descriptors back to the TX ring (via
> xskq_cons_cancel_n) is not a appropriate option because an oversized
> packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
> which is an obviously deadlock bug in the TX path.
> 
> Also move the desc->addr assignment in xsk_build_skb() above the
> overflow check so that the current descriptor's address is recorded
> before a potential -EOVERFLOW jump to free_err, consistent with the
> zerocopy path in xsk_build_skb_zerocopy().
> 
> [1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> 
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

Sorry for the noise, got lost in my inbox and replied on v3.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> ---
>  net/xdp/xsk.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index f8c8a8c9dfba..0a6203c42576 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -793,8 +793,11 @@ static void xsk_consume_skb(struct sk_buff *skb)
>  
>  static void xsk_drop_skb(struct sk_buff *skb)
>  {
> -	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
> -	xsk_consume_skb(skb);
> +	struct xdp_sock *xs = xdp_sk(skb->sk);
> +
> +	xs->tx->invalid_descs += xsk_get_num_desc(skb);
> +	consume_skb(skb);
> +	xs->skb = NULL;
>  }
>  
>  static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> @@ -876,7 +879,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  			return ERR_PTR(-ENOMEM);
>  
>  		/* in case of -EOVERFLOW that could happen below,
> -		 * xsk_consume_skb() will release this node as whole skb
> +		 * xsk_drop_skb() will release this node as whole skb
>  		 * would be dropped, which implies freeing all list elements
>  		 */
>  		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> @@ -968,6 +971,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  				goto free_err;
>  			}
>  
> +			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> +
>  			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
>  				err = -EOVERFLOW;
>  				goto free_err;
> @@ -985,8 +990,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  
>  			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
>  			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
> -
> -			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
>  		}
>  	}
>  
> -- 
> 2.43.7
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
  2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
  2026-05-20  0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
@ 2026-05-20  0:42 ` Jason Xing
  2026-05-20 16:10   ` Maciej Fijalkowski
  2026-05-20  0:42 ` [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
only the current descriptor is released from the TX ring. The remaining
continuation descriptors of the same packet stay in the ring. Since
xs->skb is set to NULL after the drop, the TX loop picks up these
leftover frags and misinterprets each one as the beginning of a new
packet, corrupting the packet stream.

Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
so we have a chance to examine and handle the potential remaining descs
of this big overflow'ed skb.

When the last fragment (without XDP_PKT_CONTD) is processed, the flag
is cleared and the loop continues to process subsequent descriptors
with the remaining budget. This behavior follows how previous xmit path
treats overflow packets.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  1 +
 net/xdp/xsk.c          | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ebac60a3d8a1..8b51876efbed 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -80,6 +80,7 @@ struct xdp_sock {
 	 * call of __xsk_generic_xmit().
 	 */
 	struct sk_buff *skb;
+	bool drain_cont;
 
 	struct list_head map_list;
 	/* Protects map_list */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0a6203c42576..f4add7be8c93 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
 			goto out;
 		}
 
+		if (unlikely(xs->drain_cont)) {
+			unsigned long flags;
+			u32 idx;
+
+			spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
+			idx = xskq_get_prod(xs->pool->cq);
+			xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
+			xskq_prod_submit_n(xs->pool->cq, 1);
+			spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
+
+			xs->tx->invalid_descs++;
+			xskq_cons_release(xs->tx);
+			if (!xp_mb_desc(&desc))
+				xs->drain_cont = false;
+			continue;
+		}
+
 		skb = xsk_build_skb(xs, &desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
 			if (err != -EOVERFLOW)
 				goto out;
+			if (xp_mb_desc(&desc))
+				xs->drain_cont = true;
 			err = 0;
 			continue;
 		}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-20  0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
@ 2026-05-20 16:10   ` Maciej Fijalkowski
  2026-05-20 23:53     ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-20 16:10 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> only the current descriptor is released from the TX ring. The remaining
> continuation descriptors of the same packet stay in the ring. Since
> xs->skb is set to NULL after the drop, the TX loop picks up these
> leftover frags and misinterprets each one as the beginning of a new
> packet, corrupting the packet stream.
> 
> Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> so we have a chance to examine and handle the potential remaining descs
> of this big overflow'ed skb.
> 
> When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> is cleared and the loop continues to process subsequent descriptors
> with the remaining budget. This behavior follows how previous xmit path
> treats overflow packets.
> 
> Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/xdp_sock.h |  1 +
>  net/xdp/xsk.c          | 19 +++++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ebac60a3d8a1..8b51876efbed 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -80,6 +80,7 @@ struct xdp_sock {
>  	 * call of __xsk_generic_xmit().
>  	 */
>  	struct sk_buff *skb;
> +	bool drain_cont;
>  
>  	struct list_head map_list;
>  	/* Protects map_list */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0a6203c42576..f4add7be8c93 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
>  			goto out;
>  		}
>  
> +		if (unlikely(xs->drain_cont)) {

Hi Jason,

could we get away from introducing dedicated boolean to xdp_sock for
handling this rare case?

> +			unsigned long flags;
> +			u32 idx;
> +
> +			spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> +			idx = xskq_get_prod(xs->pool->cq);
> +			xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> +			xskq_prod_submit_n(xs->pool->cq, 1);
> +			spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);

It feels a bit off to do this cleanup one-by-one.

I wonder if it could be covered by xsk_drop_skb() where you would walk the
tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?

OTOH, in theory the amount of frags after we hit the overflow case could
be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
them.

> +
> +			xs->tx->invalid_descs++;
> +			xskq_cons_release(xs->tx);
> +			if (!xp_mb_desc(&desc))
> +				xs->drain_cont = false;
> +			continue;
> +		}
> +
>  		skb = xsk_build_skb(xs, &desc);
>  		if (IS_ERR(skb)) {
>  			err = PTR_ERR(skb);
>  			if (err != -EOVERFLOW)
>  				goto out;
> +			if (xp_mb_desc(&desc))
> +				xs->drain_cont = true;
>  			err = 0;
>  			continue;
>  		}
> -- 
> 2.43.7
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-20 16:10   ` Maciej Fijalkowski
@ 2026-05-20 23:53     ` Jason Xing
  2026-05-21 12:02       ` Maciej Fijalkowski
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-20 23:53 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > only the current descriptor is released from the TX ring. The remaining
> > continuation descriptors of the same packet stay in the ring. Since
> > xs->skb is set to NULL after the drop, the TX loop picks up these
> > leftover frags and misinterprets each one as the beginning of a new
> > packet, corrupting the packet stream.
> >
> > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > so we have a chance to examine and handle the potential remaining descs
> > of this big overflow'ed skb.
> >
> > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > is cleared and the loop continues to process subsequent descriptors
> > with the remaining budget. This behavior follows how previous xmit path
> > treats overflow packets.
> >
> > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/xdp_sock.h |  1 +
> >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> >  2 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index ebac60a3d8a1..8b51876efbed 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -80,6 +80,7 @@ struct xdp_sock {
> >        * call of __xsk_generic_xmit().
> >        */
> >       struct sk_buff *skb;
> > +     bool drain_cont;
> >
> >       struct list_head map_list;
> >       /* Protects map_list */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 0a6203c42576..f4add7be8c93 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> >                       goto out;
> >               }
> >
> > +             if (unlikely(xs->drain_cont)) {
>
> Hi Jason,
>
> could we get away from introducing dedicated boolean to xdp_sock for
> handling this rare case?

I've tried several ways, but they didn't work. Please see the short
conversation with Stan:
https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/

That means we always rely on a flag (that is either static or like
this patch) to recognize how we treat this abnormal descs.

>
> > +                     unsigned long flags;
> > +                     u32 idx;
> > +
> > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > +                     idx = xskq_get_prod(xs->pool->cq);
> > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
>
> It feels a bit off to do this cleanup one-by-one.

Actually no, it adheres to the basic send logic in copy mode.

>
> I wonder if it could be covered by xsk_drop_skb() where you would walk the
> tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
>
> OTOH, in theory the amount of frags after we hit the overflow case could
> be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> them.

Right, this is why we now put the logic into the xmit path just like
other kinds of descriptors. We need to check and send it under a
series of protections like the check of cq pressure and budget. I
don't think it's an exception to escape the check logic. The whole
process should be fast as there is no need to build skb and send it to
the driver as long as this case is recognized.

Thanks,
Jason

>
> > +
> > +                     xs->tx->invalid_descs++;
> > +                     xskq_cons_release(xs->tx);
> > +                     if (!xp_mb_desc(&desc))
> > +                             xs->drain_cont = false;
> > +                     continue;
> > +             }
> > +
> >               skb = xsk_build_skb(xs, &desc);
> >               if (IS_ERR(skb)) {
> >                       err = PTR_ERR(skb);
> >                       if (err != -EOVERFLOW)
> >                               goto out;
> > +                     if (xp_mb_desc(&desc))
> > +                             xs->drain_cont = true;
> >                       err = 0;
> >                       continue;
> >               }
> > --
> > 2.43.7
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-20 23:53     ` Jason Xing
@ 2026-05-21 12:02       ` Maciej Fijalkowski
  2026-05-21 13:10         ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:02 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 07:53:27AM +0800, Jason Xing wrote:
> On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > only the current descriptor is released from the TX ring. The remaining
> > > continuation descriptors of the same packet stay in the ring. Since
> > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > leftover frags and misinterprets each one as the beginning of a new
> > > packet, corrupting the packet stream.
> > >
> > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > > so we have a chance to examine and handle the potential remaining descs
> > > of this big overflow'ed skb.
> > >
> > > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > > is cleared and the loop continues to process subsequent descriptors
> > > with the remaining budget. This behavior follows how previous xmit path
> > > treats overflow packets.
> > >
> > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  include/net/xdp_sock.h |  1 +
> > >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> > >  2 files changed, 20 insertions(+)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index ebac60a3d8a1..8b51876efbed 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > >        * call of __xsk_generic_xmit().
> > >        */
> > >       struct sk_buff *skb;
> > > +     bool drain_cont;
> > >
> > >       struct list_head map_list;
> > >       /* Protects map_list */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 0a6203c42576..f4add7be8c93 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >                       goto out;
> > >               }
> > >
> > > +             if (unlikely(xs->drain_cont)) {
> >
> > Hi Jason,
> >
> > could we get away from introducing dedicated boolean to xdp_sock for
> > handling this rare case?
> 
> I've tried several ways, but they didn't work. Please see the short
> conversation with Stan:
> https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/
> 
> That means we always rely on a flag (that is either static or like
> this patch) to recognize how we treat this abnormal descs.
> 
> >
> > > +                     unsigned long flags;
> > > +                     u32 idx;
> > > +
> > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> >
> > It feels a bit off to do this cleanup one-by-one.
> 
> Actually no, it adheres to the basic send logic in copy mode.

Not really. Here you are producer to cq which we have batched variants for
usage.

But after second thoughts I don't think it would be worth to put any
further effort onto optimizations here. Anyways, even though generic xmit
is a whole bunch of mess, maybe you could pull this out onto a helper
function?

> 
> >
> > I wonder if it could be covered by xsk_drop_skb() where you would walk the
> > tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
> >
> > OTOH, in theory the amount of frags after we hit the overflow case could
> > be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> > them.
> 
> Right, this is why we now put the logic into the xmit path just like
> other kinds of descriptors. We need to check and send it under a
> series of protections like the check of cq pressure and budget. I
> don't think it's an exception to escape the check logic. The whole
> process should be fast as there is no need to build skb and send it to
> the driver as long as this case is recognized.
> 
> Thanks,
> Jason
> 
> >
> > > +
> > > +                     xs->tx->invalid_descs++;
> > > +                     xskq_cons_release(xs->tx);
> > > +                     if (!xp_mb_desc(&desc))
> > > +                             xs->drain_cont = false;
> > > +                     continue;
> > > +             }
> > > +
> > >               skb = xsk_build_skb(xs, &desc);
> > >               if (IS_ERR(skb)) {
> > >                       err = PTR_ERR(skb);
> > >                       if (err != -EOVERFLOW)
> > >                               goto out;
> > > +                     if (xp_mb_desc(&desc))
> > > +                             xs->drain_cont = true;
> > >                       err = 0;
> > >                       continue;
> > >               }
> > > --
> > > 2.43.7
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-21 12:02       ` Maciej Fijalkowski
@ 2026-05-21 13:10         ` Jason Xing
  2026-05-22  9:06           ` Magnus Karlsson
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-21 13:10 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 8:02 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, May 21, 2026 at 07:53:27AM +0800, Jason Xing wrote:
> > On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > > only the current descriptor is released from the TX ring. The remaining
> > > > continuation descriptors of the same packet stay in the ring. Since
> > > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > > leftover frags and misinterprets each one as the beginning of a new
> > > > packet, corrupting the packet stream.
> > > >
> > > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > > > so we have a chance to examine and handle the potential remaining descs
> > > > of this big overflow'ed skb.
> > > >
> > > > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > > > is cleared and the loop continues to process subsequent descriptors
> > > > with the remaining budget. This behavior follows how previous xmit path
> > > > treats overflow packets.
> > > >
> > > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  include/net/xdp_sock.h |  1 +
> > > >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> > > >  2 files changed, 20 insertions(+)
> > > >
> > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > index ebac60a3d8a1..8b51876efbed 100644
> > > > --- a/include/net/xdp_sock.h
> > > > +++ b/include/net/xdp_sock.h
> > > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > > >        * call of __xsk_generic_xmit().
> > > >        */
> > > >       struct sk_buff *skb;
> > > > +     bool drain_cont;
> > > >
> > > >       struct list_head map_list;
> > > >       /* Protects map_list */
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 0a6203c42576..f4add7be8c93 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > >                       goto out;
> > > >               }
> > > >
> > > > +             if (unlikely(xs->drain_cont)) {
> > >
> > > Hi Jason,
> > >
> > > could we get away from introducing dedicated boolean to xdp_sock for
> > > handling this rare case?
> >
> > I've tried several ways, but they didn't work. Please see the short
> > conversation with Stan:
> > https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/
> >
> > That means we always rely on a flag (that is either static or like
> > this patch) to recognize how we treat this abnormal descs.
> >
> > >
> > > > +                     unsigned long flags;
> > > > +                     u32 idx;
> > > > +
> > > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> > >
> > > It feels a bit off to do this cleanup one-by-one.
> >
> > Actually no, it adheres to the basic send logic in copy mode.
>
> Not really. Here you are producer to cq which we have batched variants for
> usage.
>
> But after second thoughts I don't think it would be worth to put any
> further effort onto optimizations here. Anyways, even though generic xmit
> is a whole bunch of mess, maybe you could pull this out onto a helper
> function?

Sure thing. My takeaway is that it will be refactored once as the
batch xmit feature lands in the tree.

Thanks,
Jason

>
> >
> > >
> > > I wonder if it could be covered by xsk_drop_skb() where you would walk the
> > > tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
> > >
> > > OTOH, in theory the amount of frags after we hit the overflow case could
> > > be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> > > them.
> >
> > Right, this is why we now put the logic into the xmit path just like
> > other kinds of descriptors. We need to check and send it under a
> > series of protections like the check of cq pressure and budget. I
> > don't think it's an exception to escape the check logic. The whole
> > process should be fast as there is no need to build skb and send it to
> > the driver as long as this case is recognized.
> >
> > Thanks,
> > Jason
> >
> > >
> > > > +
> > > > +                     xs->tx->invalid_descs++;
> > > > +                     xskq_cons_release(xs->tx);
> > > > +                     if (!xp_mb_desc(&desc))
> > > > +                             xs->drain_cont = false;
> > > > +                     continue;
> > > > +             }
> > > > +
> > > >               skb = xsk_build_skb(xs, &desc);
> > > >               if (IS_ERR(skb)) {
> > > >                       err = PTR_ERR(skb);
> > > >                       if (err != -EOVERFLOW)
> > > >                               goto out;
> > > > +                     if (xp_mb_desc(&desc))
> > > > +                             xs->drain_cont = true;
> > > >                       err = 0;
> > > >                       continue;
> > > >               }
> > > > --
> > > > 2.43.7
> > > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-21 13:10         ` Jason Xing
@ 2026-05-22  9:06           ` Magnus Karlsson
  2026-05-22  9:22             ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Magnus Karlsson @ 2026-05-22  9:06 UTC (permalink / raw)
  To: Jason Xing
  Cc: Maciej Fijalkowski, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Thu, 21 May 2026 at 15:42, Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, May 21, 2026 at 8:02 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Thu, May 21, 2026 at 07:53:27AM +0800, Jason Xing wrote:
> > > On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > > > only the current descriptor is released from the TX ring. The remaining
> > > > > continuation descriptors of the same packet stay in the ring. Since
> > > > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > > > leftover frags and misinterprets each one as the beginning of a new
> > > > > packet, corrupting the packet stream.
> > > > >
> > > > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > > > > so we have a chance to examine and handle the potential remaining descs
> > > > > of this big overflow'ed skb.
> > > > >
> > > > > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > > > > is cleared and the loop continues to process subsequent descriptors
> > > > > with the remaining budget. This behavior follows how previous xmit path
> > > > > treats overflow packets.
> > > > >
> > > > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  include/net/xdp_sock.h |  1 +
> > > > >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> > > > >  2 files changed, 20 insertions(+)
> > > > >
> > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > index ebac60a3d8a1..8b51876efbed 100644
> > > > > --- a/include/net/xdp_sock.h
> > > > > +++ b/include/net/xdp_sock.h
> > > > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > > > >        * call of __xsk_generic_xmit().
> > > > >        */
> > > > >       struct sk_buff *skb;
> > > > > +     bool drain_cont;
> > > > >
> > > > >       struct list_head map_list;
> > > > >       /* Protects map_list */
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 0a6203c42576..f4add7be8c93 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > > >                       goto out;
> > > > >               }
> > > > >
> > > > > +             if (unlikely(xs->drain_cont)) {
> > > >
> > > > Hi Jason,
> > > >
> > > > could we get away from introducing dedicated boolean to xdp_sock for
> > > > handling this rare case?
> > >
> > > I've tried several ways, but they didn't work. Please see the short
> > > conversation with Stan:
> > > https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/
> > >
> > > That means we always rely on a flag (that is either static or like
> > > this patch) to recognize how we treat this abnormal descs.
> > >
> > > >
> > > > > +                     unsigned long flags;
> > > > > +                     u32 idx;
> > > > > +
> > > > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> > > >
> > > > It feels a bit off to do this cleanup one-by-one.
> > >
> > > Actually no, it adheres to the basic send logic in copy mode.
> >
> > Not really. Here you are producer to cq which we have batched variants for
> > usage.
> >
> > But after second thoughts I don't think it would be worth to put any
> > further effort onto optimizations here. Anyways, even though generic xmit
> > is a whole bunch of mess, maybe you could pull this out onto a helper
> > function?
>
> Sure thing. My takeaway is that it will be refactored once as the
> batch xmit feature lands in the tree.

I agree that the code has become really messy since I get a headache
when looking at it ;-). Do you want me to take a stab at simplifying
it?

> Thanks,
> Jason
>
> >
> > >
> > > >
> > > > I wonder if it could be covered by xsk_drop_skb() where you would walk the
> > > > tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
> > > >
> > > > OTOH, in theory the amount of frags after we hit the overflow case could
> > > > be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> > > > them.
> > >
> > > Right, this is why we now put the logic into the xmit path just like
> > > other kinds of descriptors. We need to check and send it under a
> > > series of protections like the check of cq pressure and budget. I
> > > don't think it's an exception to escape the check logic. The whole
> > > process should be fast as there is no need to build skb and send it to
> > > the driver as long as this case is recognized.
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > > +
> > > > > +                     xs->tx->invalid_descs++;
> > > > > +                     xskq_cons_release(xs->tx);
> > > > > +                     if (!xp_mb_desc(&desc))
> > > > > +                             xs->drain_cont = false;
> > > > > +                     continue;
> > > > > +             }
> > > > > +
> > > > >               skb = xsk_build_skb(xs, &desc);
> > > > >               if (IS_ERR(skb)) {
> > > > >                       err = PTR_ERR(skb);
> > > > >                       if (err != -EOVERFLOW)
> > > > >                               goto out;
> > > > > +                     if (xp_mb_desc(&desc))
> > > > > +                             xs->drain_cont = true;
> > > > >                       err = 0;
> > > > >                       continue;
> > > > >               }
> > > > > --
> > > > > 2.43.7
> > > > >
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-05-22  9:06           ` Magnus Karlsson
@ 2026-05-22  9:22             ` Jason Xing
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2026-05-22  9:22 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Maciej Fijalkowski, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Fri, May 22, 2026 at 5:06 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Thu, 21 May 2026 at 15:42, Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, May 21, 2026 at 8:02 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Thu, May 21, 2026 at 07:53:27AM +0800, Jason Xing wrote:
> > > > On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > > > > only the current descriptor is released from the TX ring. The remaining
> > > > > > continuation descriptors of the same packet stay in the ring. Since
> > > > > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > > > > leftover frags and misinterprets each one as the beginning of a new
> > > > > > packet, corrupting the packet stream.
> > > > > >
> > > > > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > > > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > > > > > so we have a chance to examine and handle the potential remaining descs
> > > > > > of this big overflow'ed skb.
> > > > > >
> > > > > > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > > > > > is cleared and the loop continues to process subsequent descriptors
> > > > > > with the remaining budget. This behavior follows how previous xmit path
> > > > > > treats overflow packets.
> > > > > >
> > > > > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >  include/net/xdp_sock.h |  1 +
> > > > > >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> > > > > >  2 files changed, 20 insertions(+)
> > > > > >
> > > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > > index ebac60a3d8a1..8b51876efbed 100644
> > > > > > --- a/include/net/xdp_sock.h
> > > > > > +++ b/include/net/xdp_sock.h
> > > > > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > > > > >        * call of __xsk_generic_xmit().
> > > > > >        */
> > > > > >       struct sk_buff *skb;
> > > > > > +     bool drain_cont;
> > > > > >
> > > > > >       struct list_head map_list;
> > > > > >       /* Protects map_list */
> > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > index 0a6203c42576..f4add7be8c93 100644
> > > > > > --- a/net/xdp/xsk.c
> > > > > > +++ b/net/xdp/xsk.c
> > > > > > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > > > >                       goto out;
> > > > > >               }
> > > > > >
> > > > > > +             if (unlikely(xs->drain_cont)) {
> > > > >
> > > > > Hi Jason,
> > > > >
> > > > > could we get away from introducing dedicated boolean to xdp_sock for
> > > > > handling this rare case?
> > > >
> > > > I've tried several ways, but they didn't work. Please see the short
> > > > conversation with Stan:
> > > > https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/
> > > >
> > > > That means we always rely on a flag (that is either static or like
> > > > this patch) to recognize how we treat this abnormal descs.
> > > >
> > > > >
> > > > > > +                     unsigned long flags;
> > > > > > +                     u32 idx;
> > > > > > +
> > > > > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > > > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > > > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > > > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > > > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> > > > >
> > > > > It feels a bit off to do this cleanup one-by-one.
> > > >
> > > > Actually no, it adheres to the basic send logic in copy mode.
> > >
> > > Not really. Here you are producer to cq which we have batched variants for
> > > usage.
> > >
> > > But after second thoughts I don't think it would be worth to put any
> > > further effort onto optimizations here. Anyways, even though generic xmit
> > > is a whole bunch of mess, maybe you could pull this out onto a helper
> > > function?
> >
> > Sure thing. My takeaway is that it will be refactored once as the
> > batch xmit feature lands in the tree.
>
> I agree that the code has become really messy since I get a headache
> when looking at it ;-). Do you want me to take a stab at simplifying
> it?

Thanks for your help. But not now probably. I think we can focus on
the batch xmit feature[1] where the main xmit path will be rewritten.
I admit it's still complex to understand :S Could you shed some light
on the series that I'm going to post?

[1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/

Thanks,
Jason

> >
> > >
> > > >
> > > > >
> > > > > I wonder if it could be covered by xsk_drop_skb() where you would walk the
> > > > > tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
> > > > >
> > > > > OTOH, in theory the amount of frags after we hit the overflow case could
> > > > > be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> > > > > them.
> > > >
> > > > Right, this is why we now put the logic into the xmit path just like
> > > > other kinds of descriptors. We need to check and send it under a
> > > > series of protections like the check of cq pressure and budget. I
> > > > don't think it's an exception to escape the check logic. The whole
> > > > process should be fast as there is no need to build skb and send it to
> > > > the driver as long as this case is recognized.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > > +
> > > > > > +                     xs->tx->invalid_descs++;
> > > > > > +                     xskq_cons_release(xs->tx);
> > > > > > +                     if (!xp_mb_desc(&desc))
> > > > > > +                             xs->drain_cont = false;
> > > > > > +                     continue;
> > > > > > +             }
> > > > > > +
> > > > > >               skb = xsk_build_skb(xs, &desc);
> > > > > >               if (IS_ERR(skb)) {
> > > > > >                       err = PTR_ERR(skb);
> > > > > >                       if (err != -EOVERFLOW)
> > > > > >                               goto out;
> > > > > > +                     if (xp_mb_desc(&desc))
> > > > > > +                             xs->drain_cont = true;
> > > > > >                       err = 0;
> > > > > >                       continue;
> > > > > >               }
> > > > > > --
> > > > > > 2.43.7
> > > > > >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
                   ` (2 preceding siblings ...)
  2026-05-20  0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
@ 2026-05-20  0:42 ` Jason Xing
  2026-05-20  0:42 ` [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
  2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
  5 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When the TX loop in __xsk_generic_xmit() encounters an invalid
descriptor mid-packet (e.g. an out-of-bounds address), the partial
skb is dropped and the offending descriptor is released. However,
remaining continuation descriptors belonging to the same multi-buffer
packet still sit in the TX ring. Since xs->skb becomes NULL after the
drop, the next iteration treats the leftover continuation fragment as
a brand-new packet, corrupting the packet stream.

Fix this by setting the drain_cont flag when the released descriptor
has XDP_PKT_CONTD set. On the next call to __xsk_generic_xmit(), the
drain logic introduced in the previous patch handles the remaining
fragments with normal CQ backpressure.

There is one subtle case: if a subsequent continuation descriptor also
has an invalid address, xskq_cons_peek_desc() rejects it and the
while loop is never entered, so the in-loop drain path cannot clear
drain_cont. The post-loop code already handles this: it sees
xskq_has_descs() is true (the failed descriptor was read but not
released by peek), releases it, and checks its XDP_PKT_CONTD flag.
Add an else branch so that when the released descriptor is the
last fragment (no XDP_PKT_CONTD), drain_cont is cleared. This
prevents the next valid packet from being incorrectly drained.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f4add7be8c93..de953f38b9e2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1122,6 +1122,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
 		xskq_cons_release(xs->tx);
+		xs->drain_cont = xp_mb_desc(&desc);
 	}

 out:
-- 
2.43.7

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
                   ` (3 preceding siblings ...)
  2026-05-20  0:42 ` [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
@ 2026-05-20  0:42 ` Jason Xing
  2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
  5 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2026-05-20  0:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

After the kernel xsk drain_cont patches, dropped multi-buffer
descriptors get their buffer addresses published to the completion
queue (CQ) via the skb destructor instead of being cancelled. As a
result, the CQ entries observed by user space no longer match the
software-side accounting based on valid_frags only:
__send_pkts() bumps xsk->outstanding_tx by valid_frags, while
complete_pkts() decrements it by every CQ entry it consumes,
including those produced by drops/drains. This makes
outstanding_tx underflow and causes wait_for_tx_completion() to
exit while valid descriptors are still sitting in the TX ring,
which in turn makes receive_pkts() time out for the
ALIGNED_INV_DESC_MULTI_BUFF, UNALIGNED_INV_DESC_MULTI_BUFF and
TOO_MANY_FRAGS subtests.

Fix this with two changes to the TX completion path:
- complete_pkts(): tolerate extra CQ completions by clamping
  outstanding_tx to zero instead of failing.
- wait_for_tx_completion(): after the outstanding_tx loop finishes,
  add a drain loop that kicks TX and consumes remaining CQ entries.
  After the drain loop exits, do a short usleep and one final
  complete_pkts() call so that real hardware (e.g. ice) has enough
  time to post late CQ entries before we conclude the ring is
  fully drained.

Adjust the multi-buffer invalid-desc tests so that the last
descriptor of every invalid packet has XDP_PKT_CONTD cleared.
Without this, the kernel drain_cont logic would consume
descriptors past the packet boundary and eat into the next valid
packet, breaking pkt_nb validation. Concretely:
- XSK_DESC__INVALID_OPTION is changed from 0xffff to 0xfffe so it
  no longer asserts the XDP_PKT_CONTD bit (bit 0).
- testapp_invalid_desc_mb() clears XDP_PKT_CONTD on the trailing
  descriptor of the invalid-address and invalid-length packets.
- testapp_too_many_frags() appends one extra terminating
  descriptor so the over-sized invalid packet ends with
  XDP_PKT_CONTD cleared, preventing the drain from spilling into
  the trailing sync packet.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7950c504ed28..1f196c8ebc73 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -31,7 +31,7 @@
 #define POLL_TMOUT			1000
 #define THREAD_TMOUT			3
 #define UMEM_HEADROOM_TEST_SIZE		128
-#define XSK_DESC__INVALID_OPTION	(0xffff)
+#define XSK_DESC__INVALID_OPTION	(0xfffe)
 #define XSK_UMEM__INVALID_FRAME_SIZE	(MAX_ETH_JUMBO_SIZE + 1)
 #define XSK_UMEM__LARGE_FRAME_SIZE	(3 * 1024)
 #define XSK_UMEM__MAX_FRAME_SIZE	(4 * 1024)
@@ -950,17 +950,11 @@ static int complete_pkts(struct xsk_socket_info *xsk, int batch_size)
 
 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, batch_size, &idx);
 	if (rcvd) {
-		if (rcvd > xsk->outstanding_tx) {
-			u64 addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx + rcvd - 1);
-
-			ksft_print_msg("[%s] Too many packets completed\n", __func__);
-			ksft_print_msg("Last completion address: %llx\n",
-				       (unsigned long long)addr);
-			return TEST_FAILURE;
-		}
-
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
-		xsk->outstanding_tx -= rcvd;
+		if (rcvd > xsk->outstanding_tx)
+			xsk->outstanding_tx = 0;
+		else
+			xsk->outstanding_tx -= rcvd;
 	}
 
 	return TEST_PASS;
@@ -1274,6 +1268,8 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 static int wait_for_tx_completion(struct xsk_socket_info *xsk)
 {
 	struct timeval tv_end, tv_now, tv_timeout = {THREAD_TMOUT, 0};
+	unsigned int rcvd;
+	u32 idx;
 	int ret;
 
 	ret = gettimeofday(&tv_now, NULL);
@@ -1293,6 +1289,17 @@ static int wait_for_tx_completion(struct xsk_socket_info *xsk)
 		complete_pkts(xsk, xsk->batch_size);
 	}
 
+	do {
+		if (xsk_ring_prod__needs_wakeup(&xsk->tx))
+			kick_tx(xsk);
+		rcvd = xsk_ring_cons__peek(&xsk->umem->cq, xsk->batch_size, &idx);
+		if (rcvd)
+			xsk_ring_cons__release(&xsk->umem->cq, rcvd);
+	} while (rcvd);
+
+	usleep(100);
+	complete_pkts(xsk, xsk->batch_size);
+
 	return TEST_PASS;
 }
 
@@ -2075,10 +2082,10 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		{0, 0, 0, false, 0},
 		/* Invalid address in the second frame */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, 0},
 		/* Invalid len in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, 0},
 		/* Invalid options in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
@@ -2229,7 +2236,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(2 * max_frags + 3, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2247,20 +2254,19 @@ int testapp_too_many_frags(struct test_spec *test)
 	}
 	pkts[max_frags].options = 0;
 
-	/* An invalid packet with the max amount of frags but signals packet
-	 * continues on the last frag
-	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+	/* An invalid packet with too many frags */
+	for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
 		pkts[i].len = MIN_PKT_SIZE;
 		pkts[i].options = XDP_PKT_CONTD;
 		pkts[i].valid = false;
 	}
+	pkts[2 * max_frags + 1].options = 0;
 
 	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[2 * max_frags + 2].len = MIN_PKT_SIZE;
+	pkts[2 * max_frags + 2].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 3)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
                   ` (4 preceding siblings ...)
  2026-05-20  0:42 ` [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
@ 2026-05-21 12:23 ` Maciej Fijalkowski
  2026-05-21 12:41   ` Jason Xing
  5 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:23 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> The series is the product of previous review from sashiko[1].
> 
> 1) META
> patch 1: address TOCTOU around metadata.
> 
> 2) PUBLISH of CQ
> patch 2: make sure xsk_addr->addrs[] can be published to cq when
>          overflow occurs.
> patch 3: keep cleaning up the continuation descs (more than 17) and
>          publish its address when overflow occurs.
> patch 4: like patch 3, but only handles the invalid descs cases.
> 
> [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> 
> ---
> V4
> Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> 1. correct the description of xmit path in patch 3 (sashiko)
> 2. move set logic into xmit path in patch 3 (Stan)
> 
> V3
> Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> 1. avoid breaking previous usage of sendto, and siliently handle
> overflow case (Stan, sashiko)
> 2. add one particular exception process in patch 4 (sashiko)
> 3. adjust the selftest to make sure it passes in either virutal or
> physical machines, which includes add usleep to support physical machine.
> 
> V2
> Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> 1. adjust selftests (Jakub)
> 2. add READ_ONCE in patch 1 (Stan)

FWIW I still get test failures (yes with patch 5 applied). PTAL.

> 
> 
> Jason Xing (5):
>   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
>   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
>   xsk: drain continuation descs after overflow in xsk_build_skb()
>   xsk: drain continuation descs on invalid descriptor in
>     __xsk_generic_xmit()
>   selftests/xsk: drain CQ to wait for TX completion
> 
>  include/net/xdp_sock.h                        |  1 +
>  net/xdp/xsk.c                                 | 44 +++++++++++++----
>  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
>  3 files changed, 63 insertions(+), 30 deletions(-)
> 
> -- 
> 2.43.7
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
@ 2026-05-21 12:41   ` Jason Xing
  2026-05-21 12:59     ` Maciej Fijalkowski
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-21 12:41 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > The series is the product of previous review from sashiko[1].
> >
> > 1) META
> > patch 1: address TOCTOU around metadata.
> >
> > 2) PUBLISH of CQ
> > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> >          overflow occurs.
> > patch 3: keep cleaning up the continuation descs (more than 17) and
> >          publish its address when overflow occurs.
> > patch 4: like patch 3, but only handles the invalid descs cases.
> >
> > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> >
> > ---
> > V4
> > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > 1. correct the description of xmit path in patch 3 (sashiko)
> > 2. move set logic into xmit path in patch 3 (Stan)
> >
> > V3
> > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > 1. avoid breaking previous usage of sendto, and siliently handle
> > overflow case (Stan, sashiko)
> > 2. add one particular exception process in patch 4 (sashiko)
> > 3. adjust the selftest to make sure it passes in either virutal or
> > physical machines, which includes add usleep to support physical machine.
> >
> > V2
> > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > 1. adjust selftests (Jakub)
> > 2. add READ_ONCE in patch 1 (Stan)
>
> FWIW I still get test failures (yes with patch 5 applied). PTAL.

Thanks for the test. But I've tried with ixgbe driver...

I noticed there are some flaky tests which have nothing to do with the
series. Can you confirm that it's not caused because of the series?

Thanks,
Jason


>
> >
> >
> > Jason Xing (5):
> >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> >   xsk: drain continuation descs after overflow in xsk_build_skb()
> >   xsk: drain continuation descs on invalid descriptor in
> >     __xsk_generic_xmit()
> >   selftests/xsk: drain CQ to wait for TX completion
> >
> >  include/net/xdp_sock.h                        |  1 +
> >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> >  3 files changed, 63 insertions(+), 30 deletions(-)
> >
> > --
> > 2.43.7
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-21 12:41   ` Jason Xing
@ 2026-05-21 12:59     ` Maciej Fijalkowski
  2026-05-21 13:07       ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 12:59 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > The series is the product of previous review from sashiko[1].
> > >
> > > 1) META
> > > patch 1: address TOCTOU around metadata.
> > >
> > > 2) PUBLISH of CQ
> > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > >          overflow occurs.
> > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > >          publish its address when overflow occurs.
> > > patch 4: like patch 3, but only handles the invalid descs cases.
> > >
> > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > >
> > > ---
> > > V4
> > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > 2. move set logic into xmit path in patch 3 (Stan)
> > >
> > > V3
> > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > overflow case (Stan, sashiko)
> > > 2. add one particular exception process in patch 4 (sashiko)
> > > 3. adjust the selftest to make sure it passes in either virutal or
> > > physical machines, which includes add usleep to support physical machine.
> > >
> > > V2
> > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > 1. adjust selftests (Jakub)
> > > 2. add READ_ONCE in patch 1 (Stan)
> >
> > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> 
> Thanks for the test. But I've tried with ixgbe driver...
> 
> I noticed there are some flaky tests which have nothing to do with the
> series. Can you confirm that it's not caused because of the series?

That explains the different results as i am using i40e/ice which have
multi-buffer support whereas ixgbe does not even support mbuf at XDP.
Broken tests are from mbuf cases.

> 
> Thanks,
> Jason
> 
> 
> >
> > >
> > >
> > > Jason Xing (5):
> > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > >   xsk: drain continuation descs on invalid descriptor in
> > >     __xsk_generic_xmit()
> > >   selftests/xsk: drain CQ to wait for TX completion
> > >
> > >  include/net/xdp_sock.h                        |  1 +
> > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > >
> > > --
> > > 2.43.7
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-21 12:59     ` Maciej Fijalkowski
@ 2026-05-21 13:07       ` Jason Xing
  2026-05-21 14:24         ` Maciej Fijalkowski
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-21 13:07 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > The series is the product of previous review from sashiko[1].
> > > >
> > > > 1) META
> > > > patch 1: address TOCTOU around metadata.
> > > >
> > > > 2) PUBLISH of CQ
> > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > >          overflow occurs.
> > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > >          publish its address when overflow occurs.
> > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > >
> > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > >
> > > > ---
> > > > V4
> > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > >
> > > > V3
> > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > overflow case (Stan, sashiko)
> > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > physical machines, which includes add usleep to support physical machine.
> > > >
> > > > V2
> > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > 1. adjust selftests (Jakub)
> > > > 2. add READ_ONCE in patch 1 (Stan)
> > >
> > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> >
> > Thanks for the test. But I've tried with ixgbe driver...
> >
> > I noticed there are some flaky tests which have nothing to do with the
> > series. Can you confirm that it's not caused because of the series?
>
> That explains the different results as i am using i40e/ice which have
> multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> Broken tests are from mbuf cases.

That's weird. I never expected the failed tests to be about multi-buffer.

Are they the same as the output you attached last time? Or something
new? Could you please share it so that I can investigate the root
cause?

Thanks,
Jason

>
> >
> > Thanks,
> > Jason
> >
> >
> > >
> > > >
> > > >
> > > > Jason Xing (5):
> > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > >   xsk: drain continuation descs on invalid descriptor in
> > > >     __xsk_generic_xmit()
> > > >   selftests/xsk: drain CQ to wait for TX completion
> > > >
> > > >  include/net/xdp_sock.h                        |  1 +
> > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > >
> > > > --
> > > > 2.43.7
> > > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-21 13:07       ` Jason Xing
@ 2026-05-21 14:24         ` Maciej Fijalkowski
  2026-05-22  8:55           ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-21 14:24 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > The series is the product of previous review from sashiko[1].
> > > > >
> > > > > 1) META
> > > > > patch 1: address TOCTOU around metadata.
> > > > >
> > > > > 2) PUBLISH of CQ
> > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > >          overflow occurs.
> > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > >          publish its address when overflow occurs.
> > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > >
> > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > >
> > > > > ---
> > > > > V4
> > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > >
> > > > > V3
> > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > overflow case (Stan, sashiko)
> > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > physical machines, which includes add usleep to support physical machine.
> > > > >
> > > > > V2
> > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > 1. adjust selftests (Jakub)
> > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > >
> > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > >
> > > Thanks for the test. But I've tried with ixgbe driver...
> > >
> > > I noticed there are some flaky tests which have nothing to do with the
> > > series. Can you confirm that it's not caused because of the series?
> >
> > That explains the different results as i am using i40e/ice which have
> > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > Broken tests are from mbuf cases.
> 
> That's weird. I never expected the failed tests to be about multi-buffer.
> 
> Are they the same as the output you attached last time? Or something
> new? Could you please share it so that I can investigate the root
> cause?

# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 21 FAIL: SKB ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 33 FAIL: SKB UNALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 57 FAIL: DRV ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 69 FAIL: DRV UNALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 93 FAIL: ZC ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
---------------------------------------
not ok 94 FAIL: ZC TOO_MANY_FRAGS
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 105 FAIL: ZC UNALIGNED_INV_DESC_MULTI_BUFF

# 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
1..108
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 21 FAIL: SKB BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 33 FAIL: SKB BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 57 FAIL: DRV BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 69 FAIL: DRV BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 93 FAIL: ZC BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
# [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
---------------------------------------
# [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
# DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01# 
DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00# 
DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....# 
.... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 
---------------------------------------
not ok 105 FAIL: ZC BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF

# 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]

Summary:
XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]

> 
> Thanks,
> Jason
> 
> >
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > Jason Xing (5):
> > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > >     __xsk_generic_xmit()
> > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > >
> > > > >  include/net/xdp_sock.h                        |  1 +
> > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > >
> > > > > --
> > > > > 2.43.7
> > > > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-21 14:24         ` Maciej Fijalkowski
@ 2026-05-22  8:55           ` Jason Xing
  2026-05-22 13:48             ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-22  8:55 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

[-- Attachment #1: Type: text/plain, Size: 11890 bytes --]

On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > The series is the product of previous review from sashiko[1].
> > > > > >
> > > > > > 1) META
> > > > > > patch 1: address TOCTOU around metadata.
> > > > > >
> > > > > > 2) PUBLISH of CQ
> > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > >          overflow occurs.
> > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > >          publish its address when overflow occurs.
> > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > >
> > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > >
> > > > > > ---
> > > > > > V4
> > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > >
> > > > > > V3
> > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > overflow case (Stan, sashiko)
> > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > >
> > > > > > V2
> > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > 1. adjust selftests (Jakub)
> > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > >
> > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > >
> > > > Thanks for the test. But I've tried with ixgbe driver...
> > > >
> > > > I noticed there are some flaky tests which have nothing to do with the
> > > > series. Can you confirm that it's not caused because of the series?
> > >
> > > That explains the different results as i am using i40e/ice which have
> > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > Broken tests are from mbuf cases.
> >
> > That's weird. I never expected the failed tests to be about multi-buffer.
> >
> > Are they the same as the output you attached last time? Or something
> > new? Could you please share it so that I can investigate the root
> > cause?
>
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 21 FAIL: SKB ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 33 FAIL: SKB UNALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 57 FAIL: DRV ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 69 FAIL: DRV UNALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 93 FAIL: ZC ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> ---------------------------------------
> not ok 94 FAIL: ZC TOO_MANY_FRAGS
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 105 FAIL: ZC UNALIGNED_INV_DESC_MULTI_BUFF
>
> # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> 1..108
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 21 FAIL: SKB BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 33 FAIL: SKB BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 57 FAIL: DRV BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 69 FAIL: DRV BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 93 FAIL: ZC BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> ---------------------------------------
> # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> ---------------------------------------
> not ok 105 FAIL: ZC BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
>
> # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
>
> Summary:
> XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]

Sorry, Maciej. I managed to get one server with i40e nic but still
couldn't reproduce it. Can you try the attachment (that is the
replacement for v4-0005) instead? I removed those nasty CONT test
cases...

Really I don't think I have much time to spend on these tests which
makes me feel extremely annoyed... It's not easy to analyze the code
without a reproducer. The good news is that now I highly suspect that
this kind of CONT test cases pollute the whole cq which affects other
tests. Before I give up on the 0003/0004 patches, I'd like to hear
some advice from you. Thank you.

My original intention was to push batch xmit forward but at that time
sashiko pointed out some unrelated bugs accidentally.

Thanks,
Jason

>
> >
> > Thanks,
> > Jason
> >
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Jason Xing (5):
> > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > >     __xsk_generic_xmit()
> > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > >
> > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.43.7
> > > > > >

[-- Attachment #2: test.patch.1 --]
[-- Type: application/octet-stream, Size: 2980 bytes --]

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7950c504ed28..4eb8356523aa 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2070,31 +2070,25 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		/* Valid packet for synch to start with */
 		{0, MIN_PKT_SIZE, 0, true, 0},
 		/* Zero frame len is not legal */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, 0, 0, false, 0},
-		/* Invalid address in the second frame */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		/* Invalid len in the middle */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		/* Invalid options in the middle */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
+		/* Invalid address */
+		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, 0},
+		/* Invalid len */
+		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, 0},
+		/* Invalid options */
+		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false,
+		 XSK_DESC__INVALID_OPTION & ~XDP_PKT_CONTD},
 		/* Transmit 2 frags, receive 3 */
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, XDP_PKT_CONTD},
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, 0},
 		/* Middle frame crosses chunk boundary with small length */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{-MIN_PKT_SIZE / 2, MIN_PKT_SIZE, 0, false, 0},
 		/* Valid packet for synch so that something is received */
 		{0, MIN_PKT_SIZE, 0, true, 0}};
 
 	if (umem->unaligned_mode) {
 		/* Crossing a chunk boundary allowed */
-		pkts[12].valid = true;
-		pkts[13].valid = true;
+		pkts[7].valid = true;
 	}
 
 	test->mtu = MAX_ETH_JUMBO_SIZE;
@@ -2229,7 +2223,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(max_frags + 2, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2247,20 +2241,11 @@ int testapp_too_many_frags(struct test_spec *test)
 	}
 	pkts[max_frags].options = 0;
 
-	/* An invalid packet with the max amount of frags but signals packet
-	 * continues on the last frag
-	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
-		pkts[i].len = MIN_PKT_SIZE;
-		pkts[i].options = XDP_PKT_CONTD;
-		pkts[i].valid = false;
-	}
-
 	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[max_frags + 1].len = MIN_PKT_SIZE;
+	pkts[max_frags + 1].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, max_frags + 2)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-22  8:55           ` Jason Xing
@ 2026-05-22 13:48             ` Jason Xing
  2026-05-22 18:33               ` Maciej Fijalkowski
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-22 13:48 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

[-- Attachment #1: Type: text/plain, Size: 12595 bytes --]

On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > >
> > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > >
> > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > >
> > > > > > > 1) META
> > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > >
> > > > > > > 2) PUBLISH of CQ
> > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > >          overflow occurs.
> > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > >          publish its address when overflow occurs.
> > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > >
> > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > >
> > > > > > > ---
> > > > > > > V4
> > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > >
> > > > > > > V3
> > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > overflow case (Stan, sashiko)
> > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > >
> > > > > > > V2
> > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > 1. adjust selftests (Jakub)
> > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > >
> > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > >
> > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > >
> > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > series. Can you confirm that it's not caused because of the series?
> > > >
> > > > That explains the different results as i am using i40e/ice which have
> > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > Broken tests are from mbuf cases.
> > >
> > > That's weird. I never expected the failed tests to be about multi-buffer.
> > >
> > > Are they the same as the output you attached last time? Or something
> > > new? Could you please share it so that I can investigate the root
> > > cause?
> >
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 21 FAIL: SKB ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 33 FAIL: SKB UNALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 57 FAIL: DRV ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 69 FAIL: DRV UNALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 93 FAIL: ZC ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > ---------------------------------------
> > not ok 94 FAIL: ZC TOO_MANY_FRAGS
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 105 FAIL: ZC UNALIGNED_INV_DESC_MULTI_BUFF
> >
> > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > 1..108
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 21 FAIL: SKB BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 33 FAIL: SKB BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 57 FAIL: DRV BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 69 FAIL: DRV BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 93 FAIL: ZC BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > ---------------------------------------
> > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > ---------------------------------------
> > not ok 105 FAIL: ZC BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> >
> > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
> >
> > Summary:
> > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
>
> Sorry, Maciej. I managed to get one server with i40e nic but still
> couldn't reproduce it. Can you try the attachment (that is the
> replacement for v4-0005) instead? I removed those nasty CONT test
> cases...

Ah, I think I eventually figured out a solution. Maciej, could you
please test the 2nd patch instead?

This patch reworks the CONTD test cases. Cross finger.

Thanks,
Jason

>
> Really I don't think I have much time to spend on these tests which
> makes me feel extremely annoyed... It's not easy to analyze the code
> without a reproducer. The good news is that now I highly suspect that
> this kind of CONT test cases pollute the whole cq which affects other
> tests. Before I give up on the 0003/0004 patches, I'd like to hear
> some advice from you. Thank you.
>
> My original intention was to push batch xmit forward but at that time
> sashiko pointed out some unrelated bugs accidentally.
>
> Thanks,
> Jason
>
> >
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Jason Xing (5):
> > > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > > >     __xsk_generic_xmit()
> > > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > > >
> > > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.43.7
> > > > > > >

[-- Attachment #2: 0001-selftests-xsk-rework-CONTD-testcases.patch --]
[-- Type: application/octet-stream, Size: 9451 bytes --]

From 6db8d9c1fae4b798050c9502495d957627a03c46 Mon Sep 17 00:00:00 2001
From: Jason Xing <kernelxing@tencent.com>
Date: Fri, 22 May 2026 22:00:00 +0300
Subject: [PATCH] selftests/xsk: rework CONTD testcases

The existing *INV_DESC_MULTI_BUFF and TOO_MANY_FRAGS tests embed
mid-chain invalid descriptors, which collide with both the new CQ
accounting. The previous arch was not designed for this behavior:
1) it's unneeded to test the series in zc mode.
2) outstanding_tx doesn't work without newly added nb_extra_cq.
In a word, it's unlikely to support the CONTD test in both non-zc
and zc in the unified functions.

So this patch does two things: strip the previous CONTD tests,
and then group them together into two testcases.

Add two SKB/DRV-only tests to cover the series in isolation:
DRAIN_CONT and EOVERFLOW_DRAIN_CONT. Both declare the kernel-internal
extra CQ entries via nb_extra_cq so __send_pkts() keeps strict
accounting, and both are skipped under ZC since drain_cont only exists
on the SKB/DRV path.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 135 ++++++++++++++----
 .../selftests/bpf/prog_tests/test_xsk.h       |   5 +
 2 files changed, 116 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7950c504ed28..1c17123f993c 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -1247,7 +1247,8 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	pthread_mutex_unlock(&pacing_mutex);
 
 	xsk_ring_prod__submit(&xsk->tx, i);
-	xsk->outstanding_tx += valid_frags;
+	xsk->outstanding_tx += valid_frags + xsk->nb_extra_cq;
+	xsk->nb_extra_cq = 0;
 
 	if (use_poll) {
 		ret = poll(&fds, 1, POLL_TMOUT);
@@ -2070,31 +2071,25 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		/* Valid packet for synch to start with */
 		{0, MIN_PKT_SIZE, 0, true, 0},
 		/* Zero frame len is not legal */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, 0, 0, false, 0},
-		/* Invalid address in the second frame */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		/* Invalid len in the middle */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		/* Invalid options in the middle */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
+		/* Invalid address */
+		{umem_size, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, 0},
+		/* Invalid len */
+		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, 0},
+		/* Invalid options */
+		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false,
+		 XSK_DESC__INVALID_OPTION & ~XDP_PKT_CONTD},
 		/* Transmit 2 frags, receive 3 */
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, XDP_PKT_CONTD},
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, 0},
-		/* Middle frame crosses chunk boundary with small length */
-		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		/* Single frame crosses chunk boundary with small length */
 		{-MIN_PKT_SIZE / 2, MIN_PKT_SIZE, 0, false, 0},
 		/* Valid packet for synch so that something is received */
 		{0, MIN_PKT_SIZE, 0, true, 0}};
 
 	if (umem->unaligned_mode) {
 		/* Crossing a chunk boundary allowed */
-		pkts[12].valid = true;
-		pkts[13].valid = true;
+		pkts[7].valid = true;
 	}
 
 	test->mtu = MAX_ETH_JUMBO_SIZE;
@@ -2229,7 +2224,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(max_frags + 2, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2247,24 +2242,116 @@ int testapp_too_many_frags(struct test_spec *test)
 	}
 	pkts[max_frags].options = 0;
 
-	/* An invalid packet with the max amount of frags but signals packet
-	 * continues on the last frag
+	/* Valid packet for synch */
+	pkts[max_frags + 1].len = MIN_PKT_SIZE;
+	pkts[max_frags + 1].valid = true;
+
+	if (pkt_stream_generate_custom(test, pkts, max_frags + 2)) {
+		free(pkts);
+		return TEST_FAILURE;
+	}
+
+	ret = testapp_validate_traffic(test);
+	free(pkts);
+	return ret;
+}
+
+int testapp_drain_cont(struct test_spec *test)
+{
+	struct pkt pkts[] = {
+		/* Valid packet for sync */
+		{0, MIN_PKT_SIZE, 0, true, 0},
+		/* Kernel-valid first frag: builds partial SKB */
+		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		/* Kernel-invalid (len=0) with CONTD: arms drain_cont */
+		{0, 0, 0, false, XDP_PKT_CONTD},
+		/* Drained by drain_cont, CONTD keeps draining */
+		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		/* Drained by drain_cont, no CONTD ends draining */
+		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, 0},
+		/* Valid packet for sync */
+		{0, MIN_PKT_SIZE, 0, true, 0},
+	};
+
+	if (test->mode == TEST_MODE_ZC) {
+		ksft_print_msg("Can not run DRAIN_CONT test for ZC mode\n");
+		return TEST_SKIP;
+	}
+
+	test->mtu = MAX_ETH_JUMBO_SIZE;
+	if (pkt_stream_generate_custom(test, pkts, ARRAY_SIZE(pkts)))
+		return TEST_FAILURE;
+
+	/* 1 CQ from xsk_drop_skb (partial SKB) + 2 CQ from drain_cont */
+	test->ifobj_tx->xsk->nb_extra_cq = 3;
+
+	return testapp_validate_traffic(test);
+}
+
+int testapp_eoverflow_drain_cont(struct test_spec *test)
+{
+	struct ifobject *ifobj_tx = test->ifobj_tx;
+	u32 max_skb_frags;
+	struct pkt *pkts;
+	u32 nb_pkts, i;
+	int ret;
+
+	if (test->mode == TEST_MODE_ZC) {
+		ksft_print_msg("Can not run EOVERFLOW_DRAIN_CONT test for ZC mode\n");
+		return TEST_SKIP;
+	}
+
+	max_skb_frags = ifobj_tx->max_skb_frags;
+
+	/*
+	 * Layout (let N = max_skb_frags, 17 as default):
+	 *
+	 *   pkts[0]            : valid sync packet.
+	 *   pkts[1..N+1]       : N+1 CONTD frags forming a multi-buffer
+	 *                        chain that overruns MAX_SKB_FRAGS. The
+	 *                        last frag triggers -EOVERFLOW in
+	 *                        xsk_build_skb(); xsk_drop_skb() then
+	 *                        publishes N+1 accumulated UMEM
+	 *                        addresses to the CQ via the SKB
+	 *                        destructor and arms drain_cont.
+	 *   pkts[N+2..N+3]     : 2 CONTD frags consumed by drain_cont's
+	 *                        write-to-CQ branch on the next sendto().
+	 *   pkts[N+4]          : !CONTD frag that closes drain_cont
+	 *                        (xp_mb_desc(&desc) becomes false).
+	 *   pkts[N+5]          : valid sync packet, sent normally to
+	 *                        confirm that the socket recovered and
+	 *                        drain_cont is no longer active.
 	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+	nb_pkts = max_skb_frags + 6;
+	pkts = calloc(nb_pkts, sizeof(struct pkt));
+	if (!pkts)
+		return TEST_FAILURE;
+
+	test->mtu = MAX_ETH_JUMBO_SIZE;
+
+	pkts[0].len = MIN_PKT_SIZE;
+	pkts[0].valid = true;
+
+	for (i = 1; i <= max_skb_frags + 4; i++) {
 		pkts[i].len = MIN_PKT_SIZE;
 		pkts[i].options = XDP_PKT_CONTD;
 		pkts[i].valid = false;
 	}
+	pkts[max_skb_frags + 4].options = 0;
 
-	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[max_skb_frags + 5].len = MIN_PKT_SIZE;
+	pkts[max_skb_frags + 5].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, nb_pkts)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}
 
+	/* (max_skb_frags + 1) CQ from xsk_drop_skb on EOVERFLOW and
+	 * 3 from drain_cont.
+	 */
+	ifobj_tx->xsk->nb_extra_cq = max_skb_frags + 4;
+
 	ret = testapp_validate_traffic(test);
 	free(pkts);
 	return ret;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.h b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
index 1ab8aee4ce56..2e6488e0452b 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.h
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
@@ -87,6 +87,7 @@ struct xsk_socket_info {
 	struct xsk_socket *xsk;
 	struct pkt_stream *pkt_stream;
 	u32 outstanding_tx;
+	u32 nb_extra_cq;
 	u32 rxqsize;
 	u32 batch_size;
 	u8 dst_mac[ETH_ALEN];
@@ -237,6 +238,8 @@ int testapp_aligned_inv_desc(struct test_spec *test);
 int testapp_aligned_inv_desc_2k_frame(struct test_spec *test);
 int testapp_aligned_inv_desc_mb(struct test_spec *test);
 int testapp_bidirectional(struct test_spec *test);
+int testapp_drain_cont(struct test_spec *test);
+int testapp_eoverflow_drain_cont(struct test_spec *test);
 int testapp_headroom(struct test_spec *test);
 int testapp_hw_sw_max_ring_size(struct test_spec *test);
 int testapp_hw_sw_min_ring_size(struct test_spec *test);
@@ -292,6 +295,8 @@ static const struct test_spec tests[] = {
 	{.name = "XDP_METADATA_COPY_MULTI_BUFF", .test_func = testapp_xdp_metadata_mb},
 	{.name = "ALIGNED_INV_DESC_MULTI_BUFF", .test_func = testapp_aligned_inv_desc_mb},
 	{.name = "TOO_MANY_FRAGS", .test_func = testapp_too_many_frags},
+	{.name = "DRAIN_CONT", .test_func = testapp_drain_cont},
+	{.name = "EOVERFLOW_DRAIN_CONT", .test_func = testapp_eoverflow_drain_cont},
 	{.name = "XDP_ADJUST_TAIL_SHRINK", .test_func = testapp_adjust_tail_shrink},
 	{.name = "TX_QUEUE_CONSUMER", .test_func = testapp_tx_queue_consumer},
 	};
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-22 13:48             ` Jason Xing
@ 2026-05-22 18:33               ` Maciej Fijalkowski
  2026-05-22 23:49                 ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-22 18:33 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:
> On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > >
> > > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > > >
> > > > > > > > 1) META
> > > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > > >
> > > > > > > > 2) PUBLISH of CQ
> > > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > > >          overflow occurs.
> > > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > > >          publish its address when overflow occurs.
> > > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > > >
> > > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > > >
> > > > > > > > ---
> > > > > > > > V4
> > > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > > >
> > > > > > > > V3
> > > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > > overflow case (Stan, sashiko)
> > > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > > >
> > > > > > > > V2
> > > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > > 1. adjust selftests (Jakub)
> > > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > > >
> > > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > > >
> > > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > > >
> > > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > > series. Can you confirm that it's not caused because of the series?
> > > > >
> > > > > That explains the different results as i am using i40e/ice which have
> > > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > > Broken tests are from mbuf cases.
> > > >
> > > > That's weird. I never expected the failed tests to be about multi-buffer.
> > > >
> > > > Are they the same as the output you attached last time? Or something
> > > > new? Could you please share it so that I can investigate the root
> > > > cause?
> > >
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 21 FAIL: SKB ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 33 FAIL: SKB UNALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 57 FAIL: DRV ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 69 FAIL: DRV UNALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 93 FAIL: ZC ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > ---------------------------------------
> > > not ok 94 FAIL: ZC TOO_MANY_FRAGS
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 105 FAIL: ZC UNALIGNED_INV_DESC_MULTI_BUFF
> > >
> > > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > > 1..108
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 21 FAIL: SKB BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 33 FAIL: SKB BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 57 FAIL: DRV BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 69 FAIL: DRV BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 93 FAIL: ZC BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > ---------------------------------------
> > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > ---------------------------------------
> > > not ok 105 FAIL: ZC BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > >
> > > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
> > >
> > > Summary:
> > > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
> >
> > Sorry, Maciej. I managed to get one server with i40e nic but still
> > couldn't reproduce it. Can you try the attachment (that is the
> > replacement for v4-0005) instead? I removed those nasty CONT test
> > cases...
> 
> Ah, I think I eventually figured out a solution. Maciej, could you
> please test the 2nd patch instead?
> 
> This patch reworks the CONTD test cases. Cross finger.

Please don't rush things here, I believe we need to think a bit more here.
I have second thoughts about overall approach.

My understanding wrt CQ was that it is a container that holds descriptors
which have been successfully transmitted. Now we want to add also leftover
descriptors from broken packets, which might confuse user space sides in
case they were relying on behavior described above.

The intent is right of course as we don't want to lose UMEM descs, but I
feel like we need a separate mechanism for that rather than putting
invalid descs to CQ.

Does it make sense?

Besides, even though we would stay with proposed changes, behavior between
modes should be aligned. Right now ZC seems to be broken in touched
regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
we break the loop and discard the packet, never post it to CQ and these
descs are lost from user space POV. Then we would continue on next call
and interpret the rest of too big packet as a separate one (clamped) and
therefore submit corrupted packet to HW.

I'll be looking at ZC API but i do think we need a common approach,
mode-agnostic.

Thanks,
Maciej

> 
> Thanks,
> Jason
> 
> >
> > Really I don't think I have much time to spend on these tests which
> > makes me feel extremely annoyed... It's not easy to analyze the code
> > without a reproducer. The good news is that now I highly suspect that
> > this kind of CONT test cases pollute the whole cq which affects other
> > tests. Before I give up on the 0003/0004 patches, I'd like to hear
> > some advice from you. Thank you.
> >
> > My original intention was to push batch xmit forward but at that time
> > sashiko pointed out some unrelated bugs accidentally.
> >
> > Thanks,
> > Jason
> >
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Jason Xing (5):
> > > > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > > > >     __xsk_generic_xmit()
> > > > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > > > >
> > > > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.43.7
> > > > > > > >



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-22 18:33               ` Maciej Fijalkowski
@ 2026-05-22 23:49                 ` Jason Xing
  2026-05-26 19:43                   ` Maciej Fijalkowski
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Xing @ 2026-05-22 23:49 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Sat, May 23, 2026 at 2:34 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:
> > On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > >
> > > > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > > >
> > > > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > > > >
> > > > > > > > > 1) META
> > > > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > > > >
> > > > > > > > > 2) PUBLISH of CQ
> > > > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > > > >          overflow occurs.
> > > > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > > > >          publish its address when overflow occurs.
> > > > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > > > >
> > > > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > V4
> > > > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > > > >
> > > > > > > > > V3
> > > > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > > > overflow case (Stan, sashiko)
> > > > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > > > >
> > > > > > > > > V2
> > > > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > > > 1. adjust selftests (Jakub)
> > > > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > > > >
> > > > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > > > >
> > > > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > > > >
> > > > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > > > series. Can you confirm that it's not caused because of the series?
> > > > > >
> > > > > > That explains the different results as i am using i40e/ice which have
> > > > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > > > Broken tests are from mbuf cases.
> > > > >
> > > > > That's weird. I never expected the failed tests to be about multi-buffer.
> > > > >
> > > > > Are they the same as the output you attached last time? Or something
> > > > > new? Could you please share it so that I can investigate the root
> > > > > cause?
> > > >
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 21 FAIL: SKB ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 33 FAIL: SKB UNALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 57 FAIL: DRV ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 69 FAIL: DRV UNALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 93 FAIL: ZC ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > ---------------------------------------
> > > > not ok 94 FAIL: ZC TOO_MANY_FRAGS
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 105 FAIL: ZC UNALIGNED_INV_DESC_MULTI_BUFF
> > > >
> > > > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > > > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > > > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > > > 1..108
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 21 FAIL: SKB BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 33 FAIL: SKB BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 57 FAIL: DRV BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 69 FAIL: DRV BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 93 FAIL: ZC BUSY-POLL ALIGNED_INV_DESC_MULTI_BUFF
> > > > # [is_frag_valid] expected pkt_nb [11], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > ---------------------------------------
> > > > # [is_frag_valid] expected pkt_nb [10], got pkt_nb [0]
> > > > # DEBUG>> L2: dst mac: # 55# 44# 33# 22# 11# 01#
> > > > DEBUG>> L2: src mac: # 55# 44# 33# 22# 11# 00#
> > > > DEBUG>> L5: seqnum: # 0:0 # 0:1 # 0:2 # 0:3 # 0:4 # 0:5 # 0:6 # 0:7 # 0:8 # 0:9 # 0:10 # 0:11 # 0:0 # 0:0 # 0:0 # 0:0 # ....#
> > > > .... # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 # 0:0 #
> > > > ---------------------------------------
> > > > not ok 105 FAIL: ZC BUSY-POLL UNALIGNED_INV_DESC_MULTI_BUFF
> > > >
> > > > # 4 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
> > > > # Totals: pass:96 fail:8 xfail:0 xpass:0 skip:4 error:0
> > > > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
> > > >
> > > > Summary:
> > > > XSK_SELFTESTS_ens259f1np1_SOFTIRQ: [ FAIL ]
> > > > XSK_SELFTESTS_ens259f1np1_BUSY_POLL: [ FAIL ]
> > >
> > > Sorry, Maciej. I managed to get one server with i40e nic but still
> > > couldn't reproduce it. Can you try the attachment (that is the
> > > replacement for v4-0005) instead? I removed those nasty CONT test
> > > cases...
> >
> > Ah, I think I eventually figured out a solution. Maciej, could you
> > please test the 2nd patch instead?
> >
> > This patch reworks the CONTD test cases. Cross finger.
>
> Please don't rush things here, I believe we need to think a bit more here.
> I have second thoughts about overall approach.
>
> My understanding wrt CQ was that it is a container that holds descriptors
> which have been successfully transmitted. Now we want to add also leftover
> descriptors from broken packets, which might confuse user space sides in
> case they were relying on behavior described above.
>
> The intent is right of course as we don't want to lose UMEM descs, but I
> feel like we need a separate mechanism for that rather than putting
> invalid descs to CQ.

I don't sense anything strange here if we stick to put those
invalid/overflowed descriptors into cq. AF_XDP is only a tunnel that
transfers the data. That's it. A bit like how the physical link works,
which means it possibly drops data because of congestion.

Upper protocol is used to guarantee when to (re)transmit a packet -
the mechanism is the ACK driven in terms of TCP. TCP is absolutely
capable of finding such an abnormal thing happening by checking the
seq of incoming ack. My takeaway from this is we don't need to
deliberately design new stuff to fulfill direct and immediate
communication.

CQ works somehow as a notification that tells user space whether the
kernel receives the data from the app and handles them. Without
putting them in the CQ, the only thing for userspace to do is simply
wait.

With that said, IMHO, I cannot figure out why we need a separate queue
or something like that. Of course, a new notification that handles all
the possible/potential exceptions and contributes to the performance
of the upper layer is worth a try :) The latter is crucial.

>
> Does it make sense?
>
> Besides, even though we would stay with proposed changes, behavior between
> modes should be aligned. Right now ZC seems to be broken in touched
> regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
> we break the loop and discard the packet, never post it to CQ and these
> descs are lost from user space POV. Then we would continue on next call
> and interpret the rest of too big packet as a separate one (clamped) and
> therefore submit corrupted packet to HW.

Right, this is how the previous selftests changes pollute the
subsequent tests after that. I think the new version of the attachment
should pass all the tests since I put all the CONTD tests separately
into another two functions? It's pointless to test those in the zc
mode.

As to the series, if no objections or any suggestions jump into the
thread, I'll post the series within a week.

Thanks,
Jason

>
> I'll be looking at ZC API but i do think we need a common approach,
> mode-agnostic.
>
> Thanks,
> Maciej
>
> >
> > Thanks,
> > Jason
> >
> > >
> > > Really I don't think I have much time to spend on these tests which
> > > makes me feel extremely annoyed... It's not easy to analyze the code
> > > without a reproducer. The good news is that now I highly suspect that
> > > this kind of CONT test cases pollute the whole cq which affects other
> > > tests. Before I give up on the 0003/0004 patches, I'd like to hear
> > > some advice from you. Thank you.
> > >
> > > My original intention was to push batch xmit forward but at that time
> > > sashiko pointed out some unrelated bugs accidentally.
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Jason Xing (5):
> > > > > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > > > > >     __xsk_generic_xmit()
> > > > > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > > > > >
> > > > > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > 2.43.7
> > > > > > > > >
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-22 23:49                 ` Jason Xing
@ 2026-05-26 19:43                   ` Maciej Fijalkowski
  2026-05-26 23:26                     ` Jason Xing
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej Fijalkowski @ 2026-05-26 19:43 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Sat, May 23, 2026 at 07:49:00AM +0800, Jason Xing wrote:
> On Sat, May 23, 2026 at 2:34 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:
> > > On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > > > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > > > >
> > > > > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > > > > >
> > > > > > > > > > 1) META
> > > > > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > > > > >
> > > > > > > > > > 2) PUBLISH of CQ
> > > > > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > > > > >          overflow occurs.
> > > > > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > > > > >          publish its address when overflow occurs.
> > > > > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > > > > >
> > > > > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > V4
> > > > > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > > > > >
> > > > > > > > > > V3
> > > > > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > > > > overflow case (Stan, sashiko)
> > > > > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > > > > >
> > > > > > > > > > V2
> > > > > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. adjust selftests (Jakub)
> > > > > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > > > > >
> > > > > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > > > > >
> > > > > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > > > > >
> > > > > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > > > > series. Can you confirm that it's not caused because of the series?
> > > > > > >
> > > > > > > That explains the different results as i am using i40e/ice which have
> > > > > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > > > > Broken tests are from mbuf cases.
> > > > > >
> > > > > > That's weird. I never expected the failed tests to be about multi-buffer.
> > > > > >
> > > > > > Are they the same as the output you attached last time? Or something
> > > > > > new? Could you please share it so that I can investigate the root
> > > > > > cause?

[...]

> > > >
> > > > Sorry, Maciej. I managed to get one server with i40e nic but still
> > > > couldn't reproduce it. Can you try the attachment (that is the
> > > > replacement for v4-0005) instead? I removed those nasty CONT test
> > > > cases...
> > >
> > > Ah, I think I eventually figured out a solution. Maciej, could you
> > > please test the 2nd patch instead?
> > >
> > > This patch reworks the CONTD test cases. Cross finger.
> >
> > Please don't rush things here, I believe we need to think a bit more here.
> > I have second thoughts about overall approach.
> >
> > My understanding wrt CQ was that it is a container that holds descriptors
> > which have been successfully transmitted. Now we want to add also leftover
> > descriptors from broken packets, which might confuse user space sides in
> > case they were relying on behavior described above.
> >
> > The intent is right of course as we don't want to lose UMEM descs, but I
> > feel like we need a separate mechanism for that rather than putting
> > invalid descs to CQ.
> 
> I don't sense anything strange here if we stick to put those
> invalid/overflowed descriptors into cq. AF_XDP is only a tunnel that
> transfers the data. That's it. A bit like how the physical link works,
> which means it possibly drops data because of congestion.
> 
> Upper protocol is used to guarantee when to (re)transmit a packet -
> the mechanism is the ACK driven in terms of TCP. TCP is absolutely
> capable of finding such an abnormal thing happening by checking the
> seq of incoming ack. My takeaway from this is we don't need to
> deliberately design new stuff to fulfill direct and immediate
> communication.

AF_XDP is often used for L2/L3 forwarding, UDP, custom transports, so I
was afraid some existing solutions might be relying on CQ entries implying
successful Tx.

Generally this issue is highly unlikely yet a thing we need to address so
let's follow your approach, but for that we need to update documentation
and align ZC side so that we would not have to deviate test cases.

> 
> CQ works somehow as a notification that tells user space whether the
> kernel receives the data from the app and handles them. Without
> putting them in the CQ, the only thing for userspace to do is simply
> wait.

I don't follow the last part of the sentence but let's disregard it.
Userspace gets errno/retcode in generic xmit so it is aware of underlying
issues and then it's app job to act upon it.

> 
> With that said, IMHO, I cannot figure out why we need a separate queue
> or something like that. Of course, a new notification that handles all
> the possible/potential exceptions and contributes to the performance
> of the upper layer is worth a try :) The latter is crucial.

We discussed with Magnus it would be good to have a dedicated xsk_queue
stat for that case, such as 'oversized_descs' which would be bumped by the
amount of descs produced to CQ.

> 
> >
> > Does it make sense?
> >
> > Besides, even though we would stay with proposed changes, behavior between
> > modes should be aligned. Right now ZC seems to be broken in touched
> > regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
> > we break the loop and discard the packet, never post it to CQ and these
> > descs are lost from user space POV. Then we would continue on next call
> > and interpret the rest of too big packet as a separate one (clamped) and
> > therefore submit corrupted packet to HW.
> 
> Right, this is how the previous selftests changes pollute the
> subsequent tests after that. I think the new version of the attachment
> should pass all the tests since I put all the CONTD tests separately
> into another two functions? It's pointless to test those in the zc
> mode.

We need analogous fix on ZC, then no such quirks in tests should exist.
Test is doing the same thing regardless of underlying mode. Only the range
differ (MAX_SKB_FRAGS vs pool->xdp_zc_max_segs).

To wrap up, I see it like this, moving forward:
1. fix docs
2. add ss stat
3. wait for me with ZC fixes (I'm slow!)
4. inspect if tests will fly

Let me know your thoughts! Maybe Stan wants to chime in?
Maciej

> 
> As to the series, if no objections or any suggestions jump into the
> thread, I'll post the series within a week.
> 
> Thanks,
> Jason
> 
> >
> > I'll be looking at ZC API but i do think we need a common approach,
> > mode-agnostic.
> >
> > Thanks,
> > Maciej
> >
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > Really I don't think I have much time to spend on these tests which
> > > > makes me feel extremely annoyed... It's not easy to analyze the code
> > > > without a reproducer. The good news is that now I highly suspect that
> > > > this kind of CONT test cases pollute the whole cq which affects other
> > > > tests. Before I give up on the 0003/0004 patches, I'd like to hear
> > > > some advice from you. Thank you.
> > > >
> > > > My original intention was to push batch xmit forward but at that time
> > > > sashiko pointed out some unrelated bugs accidentally.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Jason Xing (5):
> > > > > > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > > > > > >     __xsk_generic_xmit()
> > > > > > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > > > > > >
> > > > > > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.43.7
> > > > > > > > > >
> >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
  2026-05-26 19:43                   ` Maciej Fijalkowski
@ 2026-05-26 23:26                     ` Jason Xing
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Xing @ 2026-05-26 23:26 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, May 27, 2026 at 3:43 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Sat, May 23, 2026 at 07:49:00AM +0800, Jason Xing wrote:
> > On Sat, May 23, 2026 at 2:34 AM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:
> > > > On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > >
> > > > > > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > > > > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > > > > >
> > > > > > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > > > > > >
> > > > > > > > > > > 1) META
> > > > > > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > > > > > >
> > > > > > > > > > > 2) PUBLISH of CQ
> > > > > > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > > > > > >          overflow occurs.
> > > > > > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > > > > > >          publish its address when overflow occurs.
> > > > > > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > > > > > >
> > > > > > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > > > > > >
> > > > > > > > > > > ---
> > > > > > > > > > > V4
> > > > > > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > > > > > >
> > > > > > > > > > > V3
> > > > > > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > > > > > overflow case (Stan, sashiko)
> > > > > > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > > > > > >
> > > > > > > > > > > V2
> > > > > > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > > > > > 1. adjust selftests (Jakub)
> > > > > > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > > > > > >
> > > > > > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > > > > > >
> > > > > > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > > > > > >
> > > > > > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > > > > > series. Can you confirm that it's not caused because of the series?
> > > > > > > >
> > > > > > > > That explains the different results as i am using i40e/ice which have
> > > > > > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > > > > > Broken tests are from mbuf cases.
> > > > > > >
> > > > > > > That's weird. I never expected the failed tests to be about multi-buffer.
> > > > > > >
> > > > > > > Are they the same as the output you attached last time? Or something
> > > > > > > new? Could you please share it so that I can investigate the root
> > > > > > > cause?
>
> [...]
>
> > > > >
> > > > > Sorry, Maciej. I managed to get one server with i40e nic but still
> > > > > couldn't reproduce it. Can you try the attachment (that is the
> > > > > replacement for v4-0005) instead? I removed those nasty CONT test
> > > > > cases...
> > > >
> > > > Ah, I think I eventually figured out a solution. Maciej, could you
> > > > please test the 2nd patch instead?
> > > >
> > > > This patch reworks the CONTD test cases. Cross finger.
> > >
> > > Please don't rush things here, I believe we need to think a bit more here.
> > > I have second thoughts about overall approach.
> > >
> > > My understanding wrt CQ was that it is a container that holds descriptors
> > > which have been successfully transmitted. Now we want to add also leftover
> > > descriptors from broken packets, which might confuse user space sides in
> > > case they were relying on behavior described above.
> > >
> > > The intent is right of course as we don't want to lose UMEM descs, but I
> > > feel like we need a separate mechanism for that rather than putting
> > > invalid descs to CQ.
> >
> > I don't sense anything strange here if we stick to put those
> > invalid/overflowed descriptors into cq. AF_XDP is only a tunnel that
> > transfers the data. That's it. A bit like how the physical link works,
> > which means it possibly drops data because of congestion.
> >
> > Upper protocol is used to guarantee when to (re)transmit a packet -
> > the mechanism is the ACK driven in terms of TCP. TCP is absolutely
> > capable of finding such an abnormal thing happening by checking the
> > seq of incoming ack. My takeaway from this is we don't need to
> > deliberately design new stuff to fulfill direct and immediate
> > communication.
>
> AF_XDP is often used for L2/L3 forwarding, UDP, custom transports, so I
> was afraid some existing solutions might be relying on CQ entries implying
> successful Tx.
>
> Generally this issue is highly unlikely yet a thing we need to address so
> let's follow your approach, but for that we need to update documentation
> and align ZC side so that we would not have to deviate test cases.

Sorry, I don't see the point why we need to align the ZC side rather
than skipping them in ZC mode simply?

>
> >
> > CQ works somehow as a notification that tells user space whether the
> > kernel receives the data from the app and handles them. Without
> > putting them in the CQ, the only thing for userspace to do is simply
> > wait.
>
> I don't follow the last part of the sentence but let's disregard it.
> Userspace gets errno/retcode in generic xmit so it is aware of underlying
> issues and then it's app job to act upon it.

Right.

>
> >
> > With that said, IMHO, I cannot figure out why we need a separate queue
> > or something like that. Of course, a new notification that handles all
> > the possible/potential exceptions and contributes to the performance
> > of the upper layer is worth a try :) The latter is crucial.
>
> We discussed with Magnus it would be good to have a dedicated xsk_queue
> stat for that case, such as 'oversized_descs' which would be bumped by the
> amount of descs produced to CQ.

That is surely necessary, which was actually put into my todo list as
well as skb drop reason in the xmit path. But it's a net-next/bpf-next
material.

>
> >
> > >
> > > Does it make sense?
> > >
> > > Besides, even though we would stay with proposed changes, behavior between
> > > modes should be aligned. Right now ZC seems to be broken in touched
> > > regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
> > > we break the loop and discard the packet, never post it to CQ and these
> > > descs are lost from user space POV. Then we would continue on next call
> > > and interpret the rest of too big packet as a separate one (clamped) and
> > > therefore submit corrupted packet to HW.
> >
> > Right, this is how the previous selftests changes pollute the
> > subsequent tests after that. I think the new version of the attachment
> > should pass all the tests since I put all the CONTD tests separately
> > into another two functions? It's pointless to test those in the zc
> > mode.
>
> We need analogous fix on ZC, then no such quirks in tests should exist.

I don't regard it as a quirk honestly. I'm curious what is the
advantage of unifying it just for this case? ZC mode doesn't need to
test the overflowed skb that appears in the __xsk_generic_xmit path,
right?

> Test is doing the same thing regardless of underlying mode. Only the range
> differ (MAX_SKB_FRAGS vs pool->xdp_zc_max_segs).
>
> To wrap up, I see it like this, moving forward:
> 1. fix docs
> 2. add ss stat
> 3. wait for me with ZC fixes (I'm slow!)
> 4. inspect if tests will fly
>
> Let me know your thoughts! Maybe Stan wants to chime in?

Roughly I agree with the points if you insist on working on ZC fixes.
I can wait for you :)

My plan is as follows:
1. only submit the 0001 patch only in v5
2. wait for your selftests fixes and then submit that series
3. move on with batch xmit since I don't expect this series to be a
blocking point

Thanks,
Jason

> Maciej
>
> >
> > As to the series, if no objections or any suggestions jump into the
> > thread, I'll post the series within a week.
> >
> > Thanks,
> > Jason
> >
> > >
> > > I'll be looking at ZC API but i do think we need a common approach,
> > > mode-agnostic.
> > >
> > > Thanks,
> > > Maciej
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > Really I don't think I have much time to spend on these tests which
> > > > > makes me feel extremely annoyed... It's not easy to analyze the code
> > > > > without a reproducer. The good news is that now I highly suspect that
> > > > > this kind of CONT test cases pollute the whole cq which affects other
> > > > > tests. Before I give up on the 0003/0004 patches, I'd like to hear
> > > > > some advice from you. Thank you.
> > > > >
> > > > > My original intention was to push batch xmit forward but at that time
> > > > > sashiko pointed out some unrelated bugs accidentally.
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Jason Xing (5):
> > > > > > > > > > >   xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > > > > > >   xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > > > > > >   xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > > > > > >   xsk: drain continuation descs on invalid descriptor in
> > > > > > > > > > >     __xsk_generic_xmit()
> > > > > > > > > > >   selftests/xsk: drain CQ to wait for TX completion
> > > > > > > > > > >
> > > > > > > > > > >  include/net/xdp_sock.h                        |  1 +
> > > > > > > > > > >  net/xdp/xsk.c                                 | 44 +++++++++++++----
> > > > > > > > > > >  .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
> > > > > > > > > > >  3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > 2.43.7
> > > > > > > > > > >
> > >
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-05-26 23:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-21 12:04   ` Maciej Fijalkowski
2026-05-20  0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-21 12:05   ` Maciej Fijalkowski
2026-05-20  0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-20 16:10   ` Maciej Fijalkowski
2026-05-20 23:53     ` Jason Xing
2026-05-21 12:02       ` Maciej Fijalkowski
2026-05-21 13:10         ` Jason Xing
2026-05-22  9:06           ` Magnus Karlsson
2026-05-22  9:22             ` Jason Xing
2026-05-20  0:42 ` [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-20  0:42 ` [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
2026-05-21 12:41   ` Jason Xing
2026-05-21 12:59     ` Maciej Fijalkowski
2026-05-21 13:07       ` Jason Xing
2026-05-21 14:24         ` Maciej Fijalkowski
2026-05-22  8:55           ` Jason Xing
2026-05-22 13:48             ` Jason Xing
2026-05-22 18:33               ` Maciej Fijalkowski
2026-05-22 23:49                 ` Jason Xing
2026-05-26 19:43                   ` Maciej Fijalkowski
2026-05-26 23:26                     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox