[PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation

BPF List
 help / color / mirror / Atom feed

* [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
@ 2026-05-02 20:07 Jason Xing
  2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

There are rare issues around xsk_build_skb(). Some of them
were founded by Sashiko[1][2].

[1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
[2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/

---
v5
Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
1. reword patch 3
2. adjust the order of patch 6 that is now moved up to be patch 2)
3. use helper in patch (Stan)

v4
Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
2. add acked-by tags (Stan)

v3
Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
2. move init skb after it's allocated successfully, so patch 4,5,6 have
   been adjusted accordingly (Stan)
3. do not support 32-bit arch (Stan)
4. add acked-by tags (Stan)

v2
Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
1. add four patches spotted by sashiko to fix buggy pre-existing
behavior
2. adjust the order of 8 patches.



Jason Xing (8):
  xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  xsk: handle NULL dereference of the skb without frags issue
  xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  xsk: avoid skb leak in XDP_TX_METADATA case
  xsk: fix xsk_addrs slab leak on multi-buffer error path
  xsk: fix u64 descriptor address truncation on 32-bit architectures

 net/xdp/xsk.c           | 115 ++++++++++++++++++++++++++--------------
 net/xdp/xsk_buff_pool.c |   3 ++
 2 files changed, 77 insertions(+), 41 deletions(-)

-- 
2.41.3


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.

AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.

The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().

Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
   (invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
    consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
    the same descriptor and re-hits the same WARN until the socket
    is closed.

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk_buff_pool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..c2521b6547e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (force_zc && force_copy)
 		return -EINVAL;

+	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+		return -EOPNOTSUPP;
+
 	if (xsk_get_pool_from_qid(netdev, queue_id))
 		return -EBUSY;

-- 
2.41.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
   hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
   is why bug can be triggered.
4. there is no chance to free this skb anymore.

Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..46660943c81e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -776,8 +776,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	addr = buffer - pool->addrs;
 
 	for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
-		if (unlikely(i >= MAX_SKB_FRAGS))
+		if (unlikely(i >= MAX_SKB_FRAGS)) {
+			if (!xs->skb)
+				kfree_skb(skb);
 			return ERR_PTR(-EOVERFLOW);
+		}
 
 		page = pool->umem->pgs[addr >> PAGE_SHIFT];
 		get_page(page);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
  2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy() (e.g., MAX_SKB_FRAGS exceeded), the
free_err -EOVERFLOW handler unconditionally dereferences xs->skb
via xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing
a NULL pointer dereference.

Fix this by guarding the existing xsk_inc_num_desc()/xsk_drop_skb()
calls with an xs->skb check (for the continuation case), and add
an else branch for the first-descriptor case that manually cancels
the one reserved CQ slot and increments invalid_descs by one to
account for the single invalid descriptor.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 46660943c81e..ff1eade29aa6 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -896,9 +896,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-		/* Drop the packet */
-		xsk_inc_num_desc(xs->skb);
-		xsk_drop_skb(xs->skb);
+		if (xs->skb) {
+			/* Drop the packet */
+			xsk_inc_num_desc(xs->skb);
+			xsk_drop_skb(xs->skb);
+		} else {
+			xsk_cq_cancel_locked(xs->pool, 1);
+			xs->tx->invalid_descs++;
+		}
 		xskq_cons_release(xs->tx);
 	} else {
 		/* Let application retry */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (2 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.

If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:

  if (skb && !skb_shinfo(skb)->nr_frags)
      kfree_skb(skb);

evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.

Fix by using a !xs->skb check to handle first frag situation, ensuring
we only free skbs that were freshly allocated in this call
(xs->skb is NULL) and never free an in-progress multi-buffer skb that
the caller still references.

Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ff1eade29aa6..ae59d1c1d2f8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -892,7 +892,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return skb;
 
 free_err:
-	if (skb && !skb_shinfo(skb)->nr_frags)
+	if (skb && !xs->skb)
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (3 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.

Postpone the init phase when we believe the allocation of first frag is
successfully completed. Before this init, skb can be safely freed by
kfree_skb().

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ae59d1c1d2f8..13bacc11fa9d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -739,8 +739,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-
-		xsk_skb_init_misc(skb, xs, desc->addr);
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
 			if (unlikely(err))
@@ -837,7 +835,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			if (unlikely(err))
 				goto free_err;
 
-			xsk_skb_init_misc(skb, xs, desc->addr);
 			if (desc->options & XDP_TX_METADATA) {
 				err = xsk_skb_metadata(skb, buffer, desc,
 						       xs->pool, hr);
@@ -887,6 +884,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		}
 	}
 
+	if (!xs->skb)
+		xsk_skb_init_misc(skb, xs, desc->addr);
 	xsk_inc_num_desc(skb);
 
 	return skb;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (4 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
   NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.

Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 13bacc11fa9d..5e6326e076ab 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -741,8 +741,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 		skb_reserve(skb, hr);
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
-			if (unlikely(err))
+			if (unlikely(err)) {
+				kfree_skb(skb);
 				return ERR_PTR(err);
+			}
 		}
 	} else {
 		struct xsk_addrs *xsk_addr;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (5 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).

If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1

If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:

    if (unlikely(num_descs > 1))
        kmem_cache_free(xsk_tx_generic_cache, destructor_arg);

Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.

Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e6326e076ab..ed96f6ec8ff2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
 	spin_lock_irqsave(&pool->cq_prod_lock, flags);
 	idx = xskq_get_prod(pool->cq);
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 
 		for (i = 0; i < num_descs; i++) {
@@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
 	u32 num_descs = xsk_get_num_desc(skb);
 	struct xsk_addrs *xsk_addr;
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
 	}
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (6 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
  2026-05-03 20:09   ` sashiko-bot
  2026-05-04 14:59   ` Stanislav Fomichev
  2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
  2026-05-06  2:40 ` patchwork-bot+netdevbpf
  9 siblings, 2 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:

    skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);

On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
mode the chunk offset is encoded in bits 48-63 of the descriptor
address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
lost entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.

Fix this by handling the 32-bit case directly in
xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an
xsk_addrs struct (the same path already used for multi-descriptor
SKBs) to store the full u64 address. The existing tagged-pointer logic
in xsk_skb_destructor_is_addr() stays unchanged: slab pointers returned
from kmem_cache_zalloc() are always word-aligned and therefore have
bit 0 clear, which correctly identifies them as a struct pointer
rather than an inline tagged address on every architecture.

Factor the shared kmem_cache_zalloc + destructor_arg assignment into
__xsk_addrs_alloc() and add a wrapper xsk_addrs_alloc() that handles
the inline-to-list upgrade (is_addr check + get_addr + num_descs = 1).
The three former open-coded kmem_cache_zalloc call sites now reduce to
a single call each.

Propagate the -ENOMEM from xsk_skb_destructor_set_addr() through
xsk_skb_init_misc() so the caller can clean up the skb via kfree_skb()
before skb->destructor is installed.

The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 88 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 32 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ed96f6ec8ff2..6bcd77068e52 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -566,9 +566,42 @@ static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
 	return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
 }
 
-static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+static struct xsk_addrs *__xsk_addrs_alloc(struct sk_buff *skb, u64 addr)
 {
-	skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+	struct xsk_addrs *xsk_addr;
+
+	xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
+	if (unlikely(!xsk_addr))
+		return NULL;
+
+	xsk_addr->addrs[0] = addr;
+	skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+	return xsk_addr;
+}
+
+static struct xsk_addrs *xsk_addrs_alloc(struct sk_buff *skb)
+{
+	struct xsk_addrs *xsk_addr;
+
+	if (!xsk_skb_destructor_is_addr(skb))
+		return (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+
+	xsk_addr = __xsk_addrs_alloc(skb, xsk_skb_destructor_get_addr(skb));
+	if (likely(xsk_addr))
+		xsk_addr->num_descs = 1;
+	return xsk_addr;
+}
+
+static int xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+{
+	if (IS_ENABLED(CONFIG_64BIT)) {
+		skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+		return 0;
+	}
+
+	if (unlikely(!__xsk_addrs_alloc(skb, addr)))
+		return -ENOMEM;
+	return 0;
 }
 
 static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -644,14 +677,20 @@ void xsk_destruct_skb(struct sk_buff *skb)
 	sock_wfree(skb);
 }
 
-static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
-			      u64 addr)
+static int xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
+			     u64 addr)
 {
+	int err;
+
+	err = xsk_skb_destructor_set_addr(skb, addr);
+	if (unlikely(err))
+		return err;
+
 	skb->dev = xs->dev;
 	skb->priority = READ_ONCE(xs->sk.sk_priority);
 	skb->mark = READ_ONCE(xs->sk.sk_mark);
 	skb->destructor = xsk_destruct_skb;
-	xsk_skb_destructor_set_addr(skb, addr);
+	return 0;
 }
 
 static void xsk_consume_skb(struct sk_buff *skb)
@@ -749,18 +788,9 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	} else {
 		struct xsk_addrs *xsk_addr;
 
-		if (xsk_skb_destructor_is_addr(skb)) {
-			xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
-						     GFP_KERNEL);
-			if (!xsk_addr)
-				return ERR_PTR(-ENOMEM);
-
-			xsk_addr->num_descs = 1;
-			xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
-			skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
-		} else {
-			xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
-		}
+		xsk_addr = xsk_addrs_alloc(skb);
+		if (!xsk_addr)
+			return ERR_PTR(-ENOMEM);
 
 		/* in case of -EOVERFLOW that could happen below,
 		 * xsk_consume_skb() will release this node as whole skb
@@ -849,19 +879,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			struct page *page;
 			u8 *vaddr;
 
-			if (xsk_skb_destructor_is_addr(skb)) {
-				xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
-							     GFP_KERNEL);
-				if (!xsk_addr) {
-					err = -ENOMEM;
-					goto free_err;
-				}
-
-				xsk_addr->num_descs = 1;
-				xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
-				skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
-			} else {
-				xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+			xsk_addr = xsk_addrs_alloc(skb);
+			if (!xsk_addr) {
+				err = -ENOMEM;
+				goto free_err;
 			}
 
 			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
@@ -886,8 +907,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		}
 	}
 
-	if (!xs->skb)
-		xsk_skb_init_misc(skb, xs, desc->addr);
+	if (!xs->skb) {
+		err = xsk_skb_init_misc(skb, xs, desc->addr);
+		if (unlikely(err))
+			goto free_err;
+	}
 	xsk_inc_num_desc(skb);
 
 	return skb;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:18     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Unconditional rejection of `XDP_UMEM_TX_SW_CSUM` breaks valid native zero-copy configurations.
- [High] The root cause of the 'wedged TX ring' remains unaddressed, allowing trivial userspace triggers.
--

commit a1862dee263d88bc3b6ca0898f858a88e2e37fce
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices

This commit attempts to prevent silent per-packet failures and wedged TX
rings when using sw-csum with drivers advertising IFF_TX_SKB_NO_LINEAR.
It does so by rejecting the bind operation with -EOPNOTSUPP during setup.

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>

> Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> XDP_TXMD_FLAGS_CHECKSUM' produces:
> 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> 2) sendmsg() returning -EINVAL without consuming the descriptor
>    (invalid_descs is not incremented),
> 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
>     consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
>     the same descriptor and re-hits the same WARN until the socket
>     is closed.

Does this completely resolve the wedged TX ring issue? While this patch
prevents skb_checksum_help() from returning -EINVAL, xsk_skb_metadata()
can still return -EINVAL for other reasons, such as an out-of-bounds
csum_start or csum_offset provided by userspace.

If xsk_build_skb() returns -EINVAL, wouldn't __xsk_generic_xmit() still
evaluate if (err != -EOVERFLOW) goto out; and break its loop without
consuming the descriptor? Might the generic transmit path need to be
updated to consume the descriptor and drop the packet on terminal errors
like -EINVAL to fully prevent userspace from wedging the ring?

> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index cd7bc50872f6..d981cfdd8535 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -175,6 +175,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (force_zc && force_copy)
>  		return -EINVAL;
>  
> +	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> +		return -EOPNOTSUPP;
> +

Will this check unconditionally reject valid native zero-copy configurations?

In native zero-copy mode (when force_zc is set), the transmission path
bypasses __xsk_generic_xmit() entirely and processes descriptors natively,
meaning it does not use xsk_build_skb_zerocopy() or invoke
skb_checksum_help().

If a UMEM legitimately has the XDP_UMEM_TX_SW_CSUM flag set (for example,
when shared across multiple devices), it appears this will now return
-EOPNOTSUPP. Should this check be restricted to only apply when using the
generic datapath?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:26     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] This problem wasn't introduced by this patch, but returning -EOVERFLOW when xs->skb is NULL leads to an unconditional NULL pointer dereference in the caller's error path.
- [High] Calling kfree_skb(skb) directly on a partially built SKB triggers the custom XSK destructor, which incorrectly submits the failed descriptor to the AF_XDP completion queue.
- [High] This problem wasn't introduced by this patch, but an identical SKB memory leak exists when xsk_skb_metadata() fails.
--

commit cf402952a5ccf9981b4749df15227be947716b0d
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS

This patch explicitly adds kfree_skb() before returning -EOVERFLOW to
its caller when hitting the MAX_SKB_FRAGS limit on a newly allocated
SKB. This prevents the SKB from being permanently leaked.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 887abed25466..d706b1e0bf60 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -856,8 +856,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  	addr = buffer - pool->addrs;
>  
>  	for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
> -		if (unlikely(i >= MAX_SKB_FRAGS))
> +		if (unlikely(i >= MAX_SKB_FRAGS)) {
> +			if (!xs->skb)
> +				kfree_skb(skb);

Does calling kfree_skb() directly here incorrectly trigger the custom
XSK destructor for a dropped packet?

Earlier in xsk_build_skb_zerocopy(), the newly allocated SKB is initialized
via xsk_skb_init_misc(), which explicitly sets skb->destructor =
xsk_destruct_skb.

When xsk_destruct_skb() runs, it calls xsk_cq_submit_addr_locked(), which
erroneously generates a spurious TX completion for userspace and bypasses
the queue's cached_prod reservation state, corrupting the completion queue
logic.

Should the destructor be cleared first to safely discard the partially
built SKB?

Note that this regression was fixed later in the patch series by commit
88b75ff1d12b ("xsk: prevent CQ desync when freeing half-built skbs in
xsk_build_skb()").

This problem wasn't introduced by this patch, but can returning -EOVERFLOW
when xs->skb is NULL cause a NULL pointer dereference in the caller's error
path?

In the caller xsk_build_skb(), the local skb pointer is set to NULL, and
execution jumps to the free_err label. Because err == -EOVERFLOW, the error
handler unconditionally calls:

	xsk_inc_num_desc(xs->skb);
	xsk_drop_skb(xs->skb);

Since xs->skb is strictly NULL, these functions will immediately dereference
the NULL pointer via skb_shinfo(NULL) inside xsk_skb_destructor_is_addr(),
causing a panic.

Note that this regression was fixed later in the patch series by commit
819a57cd3753 ("xsk: handle NULL dereference of the skb without frags
issue").

This problem wasn't introduced by this patch either, but is there an
identical SKB memory leak earlier in this function when xsk_skb_metadata()
fails?

If desc->options & XDP_TX_METADATA is set and xsk_skb_metadata() fails,
the function returns ERR_PTR(err) directly without freeing the newly
allocated skb:

	xsk_skb_init_misc(skb, xs, desc->addr);
	if (desc->options & XDP_TX_METADATA) {
		err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
		if (unlikely(err))
			return ERR_PTR(err);
	}

The caller xsk_build_skb() explicitly overrides its local skb pointer to
NULL before jumping to free_err, causing this SKB to be permanently leaked.

An unprivileged user could trigger this leak by providing malformed
metadata, leading to memory exhaustion.

Note that this regression was fixed later in the patch series by commit
f1079e939977 ("xsk: avoid skb leak in XDP_TX_METADATA case").

>  			return ERR_PTR(-EOVERFLOW);
> +		}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue
  2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:28     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Completion Queue (CQ) state corruption when handling -EOVERFLOW for the first descriptor.
--

commit 819a57cd3753db9f503ac093f83dccc180da574d
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: handle NULL dereference of the skb without frags issue

This commit fixes a NULL pointer dereference when a first descriptor
triggers -EOVERFLOW in xsk_build_skb_zerocopy() by adding a check for
xs->skb. It handles the first-descriptor case by manually canceling the CQ
slot and incrementing invalid_descs.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -976,9 +976,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
> -		/* Drop the packet */
> -		xsk_inc_num_desc(xs->skb);
> -		xsk_drop_skb(xs->skb);
> +		if (xs->skb) {
> +			/* Drop the packet */
> +			xsk_inc_num_desc(xs->skb);
> +			xsk_drop_skb(xs->skb);
> +		} else {
> +			xsk_cq_cancel_locked(xs->pool, 1);

Does this cause a Completion Queue (CQ) state corruption?

When a first descriptor triggers -EOVERFLOW in xsk_build_skb_zerocopy(), the
free_err path first calls kfree_skb(skb) just above this code. Because
xsk_skb_init_misc() was called earlier, the SKB's destructor is set to
xsk_destruct_skb(). The kfree_skb() call invokes this destructor, which
submits a completion and increments the ring's producer index.

If we then call xsk_cq_cancel_locked() here, it decrements the reservation
index while the producer index was already incremented, breaking the core
invariant cached_prod >= ring->producer. 

This causes space calculations to underflow and can overwrite unconsumed
entries in the CQ.

I see this was addressed later in the series by commit 88b75ff1d12b ("xsk:
prevent CQ desync when freeing half-built skbs in xsk_build_skb()").

> +			xs->tx->invalid_descs++;
> +		}
>  		xskq_cons_release(xs->tx);
>  	} else {
>  		/* Let application retry */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:32     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Memory leak of SKB in xsk_build_skb_zerocopy() when TX metadata processing fails.
- [High] Memory leak of xsk_addr array when an SKB fragment allocation fails and the packet is subsequently dropped.
- [High] Double completion and Completion Queue (CQ) state desynchronization when xsk_skb_metadata() fails.
--

commit 7c8b265d19cf8eaa2671c3f79683f55810dfc68a
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path

When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.

Fix by using a !xs->skb check to handle first frag situation, ensuring
we only free skbs that were freshly allocated in this call.

Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 06ee260f3afc..55378c3855d5 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -972,7 +972,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  	return skb;
>  
>  free_err:
> -	if (skb && !skb_shinfo(skb)->nr_frags)
> +	if (skb && !xs->skb)

Does this code leak the SKB if xsk_build_skb_zerocopy() allocations fail
during TX metadata processing?

If xsk_build_skb_zerocopy() allocates a new SKB using sock_alloc_send_skb()
but metadata parsing fails (e.g., xsk_skb_metadata() returns an error), it
returns ERR_PTR(err). The caller xsk_build_skb() explicitly overwrites its
local pointer with skb = NULL before jumping to free_err:

        skb = xsk_build_skb_zerocopy(xs, desc);
        if (IS_ERR(skb)) {
                err = PTR_ERR(skb);
                skb = NULL;
                goto free_err;
        }

Since skb is NULL, this check is bypassed, and the newly allocated SKB and
its associated socket write memory quota are leaked.

I understand this is fixed later in the patch series by commit "xsk: avoid
skb leak in XDP_TX_METADATA case".

Does this code leak the xsk_addr array when an SKB fragment allocation
fails and the packet is subsequently dropped?

When appending a fragment to an in-progress packet (xs->skb != NULL), the
code identifies an inline address tag and replaces it by allocating an
xsk_addr array via kmem_cache_zalloc(), initializing xsk_addr->num_descs
to 1.

If alloc_page() subsequently fails, the code returns -EAGAIN to retry later,
leaving xs->skb intact.

If the packet is dropped before a successful retry (e.g., socket close
triggering xsk_release() and xsk_drop_skb()), xsk_consume_skb() attempts to
clean up the array. However, xsk_consume_skb() uses the check
if (unlikely(num_descs > 1)) to determine if it should call
kmem_cache_free().

Since num_descs is exactly 1, the check evaluates to false, and the xsk_addrs
allocation is leaked.

I see this is fixed later in the patch series by commit "xsk: fix xsk_addrs
slab leak on multi-buffer error path".

>  		kfree_skb(skb);

Can this kfree_skb() call lead to a double completion and Completion Queue
(CQ) state desynchronization if xsk_skb_metadata() fails?

In xsk_build_skb() copy mode, xsk_skb_init_misc() sets
skb->destructor = xsk_destruct_skb. Immediately after, if xsk_skb_metadata()
fails, execution jumps to free_err and triggers this kfree_skb(skb).

This executes the SKB destructor, which unconditionally submits the failed
descriptor to the Completion Queue by advancing the user-visible
ring->producer pointer.

Immediately following the free, the free_err handler calls
xsk_cq_cancel_locked(), which decrements the internally cached cached_prod
reservation counter. This desynchronizes the internal reservation counter from
the user-visible ring pointer and falsely pushes a completion event to user
space for a descriptor that failed transmission.

I note this is fixed later in the patch series by commit "xsk: prevent CQ
desync when freeing half-built skbs in xsk_build_skb()".

>  
>  	if (err == -EOVERFLOW) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:36     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] In `xsk_build_skb_zerocopy()`, if `xsk_skb_metadata()` fails, the newly allocated `skb` is leaked.
--

commit 88b75ff1d12b482a55e1724cc457909cf3ea19dd
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()

Postpone the skb initialization phase in xsk_build_skb() until the first
frag allocation is completed to prevent the skb destructor from running
on error paths. Before the initialization, the skb can be safely freed
by kfree_skb().

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 55378c3855d5..af3c5752bb63 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -819,8 +819,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  			return ERR_PTR(err);
>  
>  		skb_reserve(skb, hr);
> -
> -		xsk_skb_init_misc(skb, xs, desc->addr);
>  		if (desc->options & XDP_TX_METADATA) {
>  			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
>  			if (unlikely(err))

Could this leak the newly allocated skb if xsk_skb_metadata() fails?

If xsk_skb_metadata() returns an error, xsk_build_skb_zerocopy() will return
ERR_PTR(err) directly without freeing the skb.

The caller receives the error pointer, sets its local skb variable to NULL,
and jumps to the free_err label:

net/xdp/xsk.c:xsk_build_skb() {
    ...
	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
		skb = xsk_build_skb_zerocopy(xs, desc);
		if (IS_ERR(skb)) {
			err = PTR_ERR(skb);
			skb = NULL;
			goto free_err;
		}
    ...
}

Because the skb is explicitly nullified here, the kfree_skb(skb) call in
the cleanup path is bypassed, leaving the socket buffer and its associated
sk_wmem_alloc charge leaked.

While this pre-existing issue is explicitly fixed later in this patch series
in commit "xsk: avoid skb leak in XDP_TX_METADATA case", should it be
addressed earlier to ensure intermediate states remain safe during a bisect?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:43     ` Jason Xing
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Time-of-Check to Time-of-Use (TOCTOU) vulnerability on shared UMEM metadata allows bypassing checksum bounds validation.
- [High] Leaked `xsk_addrs` structure in the non-zerocopy `xsk_build_skb()` error path when page allocation fails.
--

commit f1079e9399771f267ff5b6767db05016e5ee4840
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: avoid skb leak in XDP_TX_METADATA case

This commit fixes an skb leak in the XDP_TX_METADATA case by explicitly adding
kfree_skb() before returning back to its caller. The issue happens when
xsk_skb_metadata() returns an error, leading the caller to clear the skb
without freeing it.

Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index af3c5752bb63..770ba4695a9d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -821,8 +821,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  		skb_reserve(skb, hr);
>  		if (desc->options & XDP_TX_METADATA) {
>  			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);

This wasn't introduced by this patch, but is it possible for a shared UMEM
metadata TOCTOU race to bypass the checksum bounds validation?

In xsk_skb_metadata(), meta points to the metadata area within the umem buffer
(buffer - pool->tx_metadata_len), which is memory-mapped and concurrently
writable by userspace. The function performs a bounds check:

    if (unlikely(meta->request.csum_start +
                 meta->request.csum_offset +
                 sizeof(__sum16) > desc->len))
        return -EINVAL;

If this passes, it re-reads the values directly from the shared memory:

    skb->csum_start = hr + meta->request.csum_start;
    skb->csum_offset = meta->request.csum_offset;

Could a malicious user application race to overwrite these values with
arbitrarily large 16-bit integers between the validation check and the
assignment, leading to out-of-bounds DMA accesses? Might we need to copy
the metadata to a local variable (e.g., using READ_ONCE()) before validation?

> -			if (unlikely(err))
> +			if (unlikely(err)) {
> +				kfree_skb(skb);
>  				return ERR_PTR(err);
> +			}
>  		}
>  	} else {
>  		struct xsk_addrs *xsk_addr;

This wasn't introduced by this patch, but does the non-zerocopy
xsk_build_skb() leak the xsk_addrs structure on its error path?

When appending a fragment to an existing SKB (!skb is false) in
xsk_build_skb(), the code converts the destructor_arg from a direct address
to an allocated xsk_addrs structure and initializes xsk_addr->num_descs = 1.

If the subsequent alloc_page() fails, it jumps to free_err and returns
-EAGAIN, leaving the partially built SKB in xs->skb without incrementing
num_descs.

When the socket is closed or the SKB is subsequently dropped via
xsk_drop_skb(), xsk_consume_skb() is called. Because destructor_arg is now a
valid pointer, it fetches num_descs as 1. The cleanup logic relies on:

    if (unlikely(num_descs > 1))

to free the xsk_addrs structure. Since num_descs is exactly 1, does it skip
the free, permanently leaking the xsk_tx_generic_cache object?

I saw this issue is addressed later in the patch series (xsk: fix xsk_addrs
slab leak on multi-buffer error path).

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-05-03 20:09   ` sashiko-bot
  2026-05-05 19:46     ` Jason Xing
  2026-05-04 14:59   ` Stanislav Fomichev
  1 sibling, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-03 20:09 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] In the copy-mode TX path (`xsk_build_skb`), if a multi-buffer fragment exceeds the `MAX_SKB_FRAGS` limit, the buffer's address is leaked to userspace because `0` is erroneously submitted to the completion queue.
--

commit 4a776aefc60aa4d3123e7e63fd8f927f3748d008
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: fix u64 descriptor address truncation on 32-bit architectures

This commit fixes a 32-bit truncation issue when storing 64-bit descriptor
addresses in the SKB destructor argument by allocating an xsk_addrs struct.
It consolidates the allocation logic into xsk_addrs_alloc() and uses it in
both zerocopy and copy modes.

Link: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 079abd4bcb69..5e5786cd9af5 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[ ... ]

> @@ -929,19 +959,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  			struct page *page;
>  			u8 *vaddr;
>  
> -			if (xsk_skb_destructor_is_addr(skb)) {
> -				xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
> -							     GFP_KERNEL);
> -				if (!xsk_addr) {
> -					err = -ENOMEM;
> -					goto free_err;
> -				}
> -
> -				xsk_addr->num_descs = 1;
> -				xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
> -				skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
> -			} else {
> -				xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> +			xsk_addr = xsk_addrs_alloc(skb);
> +			if (!xsk_addr) {
> +				err = -ENOMEM;
> +				goto free_err;
>  			}
>  
>  			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {

This isn't a regression introduced by this patch, but while looking at the
modified code, does this error handling path leak the buffer's address to
userspace?

If this overflow check evaluates to true, the code sets err = -EOVERFLOW and
jumps to the free_err label before assigning desc->addr to
xsk_addr->addrs[xsk_addr->num_descs].

In the free_err block, it handles -EOVERFLOW by calling
xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb). This increments
num_descs to account for the dropped descriptor and eventually calls
xsk_destruct_skb() which calls xsk_cq_submit_addr_locked().

Since xsk_addr->addrs was never initialized with desc->addr for this
descriptor, does this mean it submits 0 (from kmem_cache_zalloc) to the
completion queue instead of the actual buffer address, causing userspace
to permanently lose track of the buffer?

By comparison, xsk_build_skb_zerocopy() assigns desc->addr to the array
before checking the fragment limit:

net/xdp/xsk.c:xsk_build_skb_zerocopy() {
    ...
		xsk_addr = xsk_addrs_alloc(skb);
		if (!xsk_addr)
			return ERR_PTR(-ENOMEM);

		/* in case of -EOVERFLOW that could happen below,
		 * xsk_consume_skb() will release this node as whole skb
		 * would be dropped, which implies freeing all list elements
		 */
		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
	}

	len = desc->len;
    ...
}

Should xsk_build_skb() also record the descriptor address before checking
the -EOVERFLOW condition?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=8

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-04 14:59   ` Stanislav Fomichev
  1 sibling, 0 replies; 27+ messages in thread
From: Stanislav Fomichev @ 2026-05-04 14:59 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On 05/02, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> uintptr_t cast:
> 
>     skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> 
> On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> mode the chunk offset is encoded in bits 48-63 of the descriptor
> address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> lost entirely. The completion queue then returns a truncated address to
> userspace, making buffer recycling impossible.
> 
> Fix this by handling the 32-bit case directly in
> xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an
> xsk_addrs struct (the same path already used for multi-descriptor
> SKBs) to store the full u64 address. The existing tagged-pointer logic
> in xsk_skb_destructor_is_addr() stays unchanged: slab pointers returned
> from kmem_cache_zalloc() are always word-aligned and therefore have
> bit 0 clear, which correctly identifies them as a struct pointer
> rather than an inline tagged address on every architecture.
> 
> Factor the shared kmem_cache_zalloc + destructor_arg assignment into
> __xsk_addrs_alloc() and add a wrapper xsk_addrs_alloc() that handles
> the inline-to-list upgrade (is_addr check + get_addr + num_descs = 1).
> The three former open-coded kmem_cache_zalloc call sites now reduce to
> a single call each.
> 
> Propagate the -ENOMEM from xsk_skb_destructor_set_addr() through
> xsk_skb_init_misc() so the caller can clean up the skb via kfree_skb()
> before skb->destructor is installed.
> 
> The overhead is one extra kmem_cache_zalloc per first descriptor on
> 32-bit only; 64-bit builds are completely unchanged.
> 
> Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
> Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

LGTM, thanks!

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (7 preceding siblings ...)
  2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-05-05 15:44 ` Alexander Lobakin
  2026-05-05 19:09   ` Jason Xing
  2026-05-06  2:40 ` patchwork-bot+netdevbpf
  9 siblings, 1 reply; 27+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:44 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

From: Jason Xing <kerneljasonxing@gmail.com>
Date: Sat,  2 May 2026 23:07:14 +0300

> From: Jason Xing <kernelxing@tencent.com>
> 
> There are rare issues around xsk_build_skb(). Some of them
> were founded by Sashiko[1][2].

For the series:

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Just don't forget to look at Sashiko's replies, I haven't looked deep
into them, so can't say whether anything there makes sense.

> 
> [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
> 
> ---
> v5
> Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
> 1. reword patch 3
> 2. adjust the order of patch 6 that is now moved up to be patch 2)
> 3. use helper in patch (Stan)
> 
> v4
> Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
> 1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
> 2. add acked-by tags (Stan)
> 
> v3
> Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
> 1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
> 2. move init skb after it's allocated successfully, so patch 4,5,6 have
>    been adjusted accordingly (Stan)
> 3. do not support 32-bit arch (Stan)
> 4. add acked-by tags (Stan)
> 
> v2
> Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
> 1. add four patches spotted by sashiko to fix buggy pre-existing
> behavior
> 2. adjust the order of 8 patches.
> 
> 
> 
> Jason Xing (8):
>   xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
>   xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
>   xsk: handle NULL dereference of the skb without frags issue
>   xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
>   xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
>   xsk: avoid skb leak in XDP_TX_METADATA case
>   xsk: fix xsk_addrs slab leak on multi-buffer error path
>   xsk: fix u64 descriptor address truncation on 32-bit architectures
> 
>  net/xdp/xsk.c           | 115 ++++++++++++++++++++++++++--------------
>  net/xdp/xsk_buff_pool.c |   3 ++
>  2 files changed, 77 insertions(+), 41 deletions(-)

Thanks,
Olek

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
  2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
@ 2026-05-05 19:09   ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:09 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Tue, May 5, 2026 at 6:45 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Sat,  2 May 2026 23:07:14 +0300
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > There are rare issues around xsk_build_skb(). Some of them
> > were founded by Sashiko[1][2].
>
> For the series:
>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>
> Just don't forget to look at Sashiko's replies, I haven't looked deep
> into them, so can't say whether anything there makes sense.

Thanks for the review. I'm currently away from the keyboard because of
the long vacation. Now I'll handle them one by one in the following
two days. Hopefully, I can manage it today :P

Thanks,
Jason

>
> >
> > [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> > [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
> >
> > ---
> > v5
> > Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
> > 1. reword patch 3
> > 2. adjust the order of patch 6 that is now moved up to be patch 2)
> > 3. use helper in patch (Stan)
> >
> > v4
> > Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
> > 1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
> > 2. add acked-by tags (Stan)
> >
> > v3
> > Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
> > 1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
> > 2. move init skb after it's allocated successfully, so patch 4,5,6 have
> >    been adjusted accordingly (Stan)
> > 3. do not support 32-bit arch (Stan)
> > 4. add acked-by tags (Stan)
> >
> > v2
> > Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
> > 1. add four patches spotted by sashiko to fix buggy pre-existing
> > behavior
> > 2. adjust the order of 8 patches.
> >
> >
> >
> > Jason Xing (8):
> >   xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
> >   xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
> >   xsk: handle NULL dereference of the skb without frags issue
> >   xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
> >   xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
> >   xsk: avoid skb leak in XDP_TX_METADATA case
> >   xsk: fix xsk_addrs slab leak on multi-buffer error path
> >   xsk: fix u64 descriptor address truncation on 32-bit architectures
> >
> >  net/xdp/xsk.c           | 115 ++++++++++++++++++++++++++--------------
> >  net/xdp/xsk_buff_pool.c |   3 ++
> >  2 files changed, 77 insertions(+), 41 deletions(-)
>
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:18     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:18 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] Unconditional rejection of `XDP_UMEM_TX_SW_CSUM` breaks valid native zero-copy configurations.
> - [High] The root cause of the 'wedged TX ring' remains unaddressed, allowing trivial userspace triggers.
> --
>
> commit a1862dee263d88bc3b6ca0898f858a88e2e37fce
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
>
> This commit attempts to prevent silent per-packet failures and wedged TX
> rings when using sw-csum with drivers advertising IFF_TX_SKB_NO_LINEAR.
> It does so by rejecting the bind operation with -EOPNOTSUPP during setup.
>
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
>
> > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > 2) sendmsg() returning -EINVAL without consuming the descriptor
> >    (invalid_descs is not incremented),
> > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> >     consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> >     the same descriptor and re-hits the same WARN until the socket
> >     is closed.
>
> Does this completely resolve the wedged TX ring issue? While this patch
> prevents skb_checksum_help() from returning -EINVAL, xsk_skb_metadata()
> can still return -EINVAL for other reasons, such as an out-of-bounds
> csum_start or csum_offset provided by userspace.
>
> If xsk_build_skb() returns -EINVAL, wouldn't __xsk_generic_xmit() still
> evaluate if (err != -EOVERFLOW) goto out; and break its loop without
> consuming the descriptor? Might the generic transmit path need to be
> updated to consume the descriptor and drop the packet on terminal errors
> like -EINVAL to fully prevent userspace from wedging the ring?

I believe we've 'discussed' this before. This is a fresh new and
pre-existing bug (happending) unrelated to this one. A few days ago,
thanks to your reminder, it was put on my to-do list.

>
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index cd7bc50872f6..d981cfdd8535 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -175,6 +175,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >       if (force_zc && force_copy)
> >               return -EINVAL;
> >
> > +     if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > +             return -EOPNOTSUPP;
> > +
>
> Will this check unconditionally reject valid native zero-copy configurations?
>
> In native zero-copy mode (when force_zc is set), the transmission path
> bypasses __xsk_generic_xmit() entirely and processes descriptors natively,
> meaning it does not use xsk_build_skb_zerocopy() or invoke
> skb_checksum_help().
>
> If a UMEM legitimately has the XDP_UMEM_TX_SW_CSUM flag set (for example,
> when shared across multiple devices), it appears this will now return
> -EOPNOTSUPP. Should this check be restricted to only apply when using the
> generic datapath?

The same topic we 'discussed' before. No one is using the software
checksum feature on zerocopy mode. People will be warned because of
wrong usage and then adjust their own codes correspondingly.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:26     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:26 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [Critical] This problem wasn't introduced by this patch, but returning -EOVERFLOW when xs->skb is NULL leads to an unconditional NULL pointer dereference in the caller's error path.
> - [High] Calling kfree_skb(skb) directly on a partially built SKB triggers the custom XSK destructor, which incorrectly submits the failed descriptor to the AF_XDP completion queue.
> - [High] This problem wasn't introduced by this patch, but an identical SKB memory leak exists when xsk_skb_metadata() fails.
> --
>
> commit cf402952a5ccf9981b4749df15227be947716b0d
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
>
> This patch explicitly adds kfree_skb() before returning -EOVERFLOW to
> its caller when hitting the MAX_SKB_FRAGS limit on a newly allocated
> SKB. This prevents the SKB from being permanently leaked.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 887abed25466..d706b1e0bf60 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -856,8 +856,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >       addr = buffer - pool->addrs;
> >
> >       for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
> > -             if (unlikely(i >= MAX_SKB_FRAGS))
> > +             if (unlikely(i >= MAX_SKB_FRAGS)) {
> > +                     if (!xs->skb)
> > +                             kfree_skb(skb);
>
> Does calling kfree_skb() directly here incorrectly trigger the custom
> XSK destructor for a dropped packet?
>
> Earlier in xsk_build_skb_zerocopy(), the newly allocated SKB is initialized
> via xsk_skb_init_misc(), which explicitly sets skb->destructor =
> xsk_destruct_skb.
>
> When xsk_destruct_skb() runs, it calls xsk_cq_submit_addr_locked(), which
> erroneously generates a spurious TX completion for userspace and bypasses
> the queue's cached_prod reservation state, corrupting the completion queue
> logic.
>
> Should the destructor be cleared first to safely discard the partially
> built SKB?
>
> Note that this regression was fixed later in the patch series by commit
> 88b75ff1d12b ("xsk: prevent CQ desync when freeing half-built skbs in
> xsk_build_skb()").
>
> This problem wasn't introduced by this patch, but can returning -EOVERFLOW
> when xs->skb is NULL cause a NULL pointer dereference in the caller's error
> path?

It's a very carefully designed order of this series. Without this
patch the skb is unconditionally leaked (permanent memory loss), which
is strictly worse than a single spurious CQ completion that userspace
can handle. Patch 5/8 then eliminates the destructor issue by
postponing xsk_skb_init_misc() past all error paths.

This is why I move this patch ahead in this version.

>
> In the caller xsk_build_skb(), the local skb pointer is set to NULL, and
> execution jumps to the free_err label. Because err == -EOVERFLOW, the error
> handler unconditionally calls:
>
>         xsk_inc_num_desc(xs->skb);
>         xsk_drop_skb(xs->skb);
>
> Since xs->skb is strictly NULL, these functions will immediately dereference
> the NULL pointer via skb_shinfo(NULL) inside xsk_skb_destructor_is_addr(),
> causing a panic.
>
> Note that this regression was fixed later in the patch series by commit
> 819a57cd3753 ("xsk: handle NULL dereference of the skb without frags
> issue").
>
> This problem wasn't introduced by this patch either, but is there an
> identical SKB memory leak earlier in this function when xsk_skb_metadata()
> fails?

As you noticed, this patch doesn't introduce new bugs at all and the
following patches take care of it.

>
> If desc->options & XDP_TX_METADATA is set and xsk_skb_metadata() fails,
> the function returns ERR_PTR(err) directly without freeing the newly
> allocated skb:
>
>         xsk_skb_init_misc(skb, xs, desc->addr);
>         if (desc->options & XDP_TX_METADATA) {
>                 err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
>                 if (unlikely(err))
>                         return ERR_PTR(err);
>         }
>
> The caller xsk_build_skb() explicitly overrides its local skb pointer to
> NULL before jumping to free_err, causing this SKB to be permanently leaked.
>
> An unprivileged user could trigger this leak by providing malformed
> metadata, leading to memory exhaustion.
>
> Note that this regression was fixed later in the patch series by commit
> f1079e939977 ("xsk: avoid skb leak in XDP_TX_METADATA case").

Right, it has been fixed...

No action needs to be taken so far.

Thanks,
Jason

>
> >                       return ERR_PTR(-EOVERFLOW);
> > +             }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:28     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:28 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Completion Queue (CQ) state corruption when handling -EOVERFLOW for the first descriptor.
> --
>
> commit 819a57cd3753db9f503ac093f83dccc180da574d
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: handle NULL dereference of the skb without frags issue
>
> This commit fixes a NULL pointer dereference when a first descriptor
> triggers -EOVERFLOW in xsk_build_skb_zerocopy() by adding a check for
> xs->skb. It handles the first-descriptor case by manually canceling the CQ
> slot and incrementing invalid_descs.
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -976,9 +976,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >               kfree_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
> > -             /* Drop the packet */
> > -             xsk_inc_num_desc(xs->skb);
> > -             xsk_drop_skb(xs->skb);
> > +             if (xs->skb) {
> > +                     /* Drop the packet */
> > +                     xsk_inc_num_desc(xs->skb);
> > +                     xsk_drop_skb(xs->skb);
> > +             } else {
> > +                     xsk_cq_cancel_locked(xs->pool, 1);
>
> Does this cause a Completion Queue (CQ) state corruption?
>
> When a first descriptor triggers -EOVERFLOW in xsk_build_skb_zerocopy(), the
> free_err path first calls kfree_skb(skb) just above this code. Because
> xsk_skb_init_misc() was called earlier, the SKB's destructor is set to
> xsk_destruct_skb(). The kfree_skb() call invokes this destructor, which
> submits a completion and increments the ring's producer index.
>
> If we then call xsk_cq_cancel_locked() here, it decrements the reservation
> index while the producer index was already incremented, breaking the core
> invariant cached_prod >= ring->producer.
>
> This causes space calculations to underflow and can overwrite unconsumed
> entries in the CQ.
>
> I see this was addressed later in the series by commit 88b75ff1d12b ("xsk:
> prevent CQ desync when freeing half-built skbs in xsk_build_skb()").

Yep, it has been resolved in the series.

Thanks,
Jason

>
> > +                     xs->tx->invalid_descs++;
> > +             }
> >               xskq_cons_release(xs->tx);
> >       } else {
> >               /* Let application retry */
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:32     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:32 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Memory leak of SKB in xsk_build_skb_zerocopy() when TX metadata processing fails.
> - [High] Memory leak of xsk_addr array when an SKB fragment allocation fails and the packet is subsequently dropped.
> - [High] Double completion and Completion Queue (CQ) state desynchronization when xsk_skb_metadata() fails.
> --
>
> commit 7c8b265d19cf8eaa2671c3f79683f55810dfc68a
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
>
> When xsk_build_skb() processes multi-buffer packets in copy mode, the
> first descriptor stores data into the skb linear area without adding
> any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> to accumulate subsequent descriptors.
>
> Fix by using a !xs->skb check to handle first frag situation, ensuring
> we only free skbs that were freshly allocated in this call.
>
> Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 06ee260f3afc..55378c3855d5 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -972,7 +972,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       return skb;
> >
> >  free_err:
> > -     if (skb && !skb_shinfo(skb)->nr_frags)
> > +     if (skb && !xs->skb)
>
> Does this code leak the SKB if xsk_build_skb_zerocopy() allocations fail
> during TX metadata processing?
>
> If xsk_build_skb_zerocopy() allocates a new SKB using sock_alloc_send_skb()
> but metadata parsing fails (e.g., xsk_skb_metadata() returns an error), it
> returns ERR_PTR(err). The caller xsk_build_skb() explicitly overwrites its
> local pointer with skb = NULL before jumping to free_err:
>
>         skb = xsk_build_skb_zerocopy(xs, desc);
>         if (IS_ERR(skb)) {
>                 err = PTR_ERR(skb);
>                 skb = NULL;
>                 goto free_err;
>         }
>
> Since skb is NULL, this check is bypassed, and the newly allocated SKB and
> its associated socket write memory quota are leaked.
>
> I understand this is fixed later in the patch series by commit "xsk: avoid
> skb leak in XDP_TX_METADATA case".
>
> Does this code leak the xsk_addr array when an SKB fragment allocation
> fails and the packet is subsequently dropped?
>
> When appending a fragment to an in-progress packet (xs->skb != NULL), the
> code identifies an inline address tag and replaces it by allocating an
> xsk_addr array via kmem_cache_zalloc(), initializing xsk_addr->num_descs
> to 1.
>
> If alloc_page() subsequently fails, the code returns -EAGAIN to retry later,
> leaving xs->skb intact.
>
> If the packet is dropped before a successful retry (e.g., socket close
> triggering xsk_release() and xsk_drop_skb()), xsk_consume_skb() attempts to
> clean up the array. However, xsk_consume_skb() uses the check
> if (unlikely(num_descs > 1)) to determine if it should call
> kmem_cache_free().
>
> Since num_descs is exactly 1, the check evaluates to false, and the xsk_addrs
> allocation is leaked.
>
> I see this is fixed later in the patch series by commit "xsk: fix xsk_addrs
> slab leak on multi-buffer error path".
>
> >               kfree_skb(skb);
>
> Can this kfree_skb() call lead to a double completion and Completion Queue
> (CQ) state desynchronization if xsk_skb_metadata() fails?
>
> In xsk_build_skb() copy mode, xsk_skb_init_misc() sets
> skb->destructor = xsk_destruct_skb. Immediately after, if xsk_skb_metadata()
> fails, execution jumps to free_err and triggers this kfree_skb(skb).
>
> This executes the SKB destructor, which unconditionally submits the failed
> descriptor to the Completion Queue by advancing the user-visible
> ring->producer pointer.
>
> Immediately following the free, the free_err handler calls
> xsk_cq_cancel_locked(), which decrements the internally cached cached_prod
> reservation counter. This desynchronizes the internal reservation counter from
> the user-visible ring pointer and falsely pushes a completion event to user
> space for a descriptor that failed transmission.
>
> I note this is fixed later in the patch series by commit "xsk: prevent CQ
> desync when freeing half-built skbs in xsk_build_skb()".

Those three points you brought up are all handled in the series.
Again, the order of this series is designed deliberately to avoid
new/severe problems.

Thanks,
Jason

>
> >
> >       if (err == -EOVERFLOW) {
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:36     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:36 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] In `xsk_build_skb_zerocopy()`, if `xsk_skb_metadata()` fails, the newly allocated `skb` is leaked.
> --
>
> commit 88b75ff1d12b482a55e1724cc457909cf3ea19dd
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
>
> Postpone the skb initialization phase in xsk_build_skb() until the first
> frag allocation is completed to prevent the skb destructor from running
> on error paths. Before the initialization, the skb can be safely freed
> by kfree_skb().
>
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 55378c3855d5..af3c5752bb63 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -819,8 +819,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >                       return ERR_PTR(err);
> >
> >               skb_reserve(skb, hr);
> > -
> > -             xsk_skb_init_misc(skb, xs, desc->addr);
> >               if (desc->options & XDP_TX_METADATA) {
> >                       err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
> >                       if (unlikely(err))
>
> Could this leak the newly allocated skb if xsk_skb_metadata() fails?
>
> If xsk_skb_metadata() returns an error, xsk_build_skb_zerocopy() will return
> ERR_PTR(err) directly without freeing the skb.
>
> The caller receives the error pointer, sets its local skb variable to NULL,
> and jumps to the free_err label:
>
> net/xdp/xsk.c:xsk_build_skb() {
>     ...
>         if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
>                 skb = xsk_build_skb_zerocopy(xs, desc);
>                 if (IS_ERR(skb)) {
>                         err = PTR_ERR(skb);
>                         skb = NULL;
>                         goto free_err;
>                 }
>     ...
> }
>
> Because the skb is explicitly nullified here, the kfree_skb(skb) call in
> the cleanup path is bypassed, leaving the socket buffer and its associated
> sk_wmem_alloc charge leaked.
>
> While this pre-existing issue is explicitly fixed later in this patch series
> in commit "xsk: avoid skb leak in XDP_TX_METADATA case", should it be
> addressed earlier to ensure intermediate states remain safe during a bisect?

No, again, I've mentioned this previously. Not the same issue here. No
more severe issue is caused because of this. And it is fixed in the
series.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:43     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:43 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [Critical] Time-of-Check to Time-of-Use (TOCTOU) vulnerability on shared UMEM metadata allows bypassing checksum bounds validation.
> - [High] Leaked `xsk_addrs` structure in the non-zerocopy `xsk_build_skb()` error path when page allocation fails.
> --
>
> commit f1079e9399771f267ff5b6767db05016e5ee4840
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: avoid skb leak in XDP_TX_METADATA case
>
> This commit fixes an skb leak in the XDP_TX_METADATA case by explicitly adding
> kfree_skb() before returning back to its caller. The issue happens when
> xsk_skb_metadata() returns an error, leading the caller to clear the skb
> without freeing it.
>
> Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index af3c5752bb63..770ba4695a9d 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -821,8 +821,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >               skb_reserve(skb, hr);
> >               if (desc->options & XDP_TX_METADATA) {
> >                       err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
>
> This wasn't introduced by this patch, but is it possible for a shared UMEM
> metadata TOCTOU race to bypass the checksum bounds validation?
>
> In xsk_skb_metadata(), meta points to the metadata area within the umem buffer
> (buffer - pool->tx_metadata_len), which is memory-mapped and concurrently
> writable by userspace. The function performs a bounds check:
>
>     if (unlikely(meta->request.csum_start +
>                  meta->request.csum_offset +
>                  sizeof(__sum16) > desc->len))
>         return -EINVAL;
>
> If this passes, it re-reads the values directly from the shared memory:
>
>     skb->csum_start = hr + meta->request.csum_start;
>     skb->csum_offset = meta->request.csum_offset;
>
> Could a malicious user application race to overwrite these values with
> arbitrarily large 16-bit integers between the validation check and the
> assignment, leading to out-of-bounds DMA accesses? Might we need to copy
> the metadata to a local variable (e.g., using READ_ONCE()) before validation?
>
> > -                     if (unlikely(err))
> > +                     if (unlikely(err)) {
> > +                             kfree_skb(skb);
> >                               return ERR_PTR(err);
> > +                     }
> >               }
> >       } else {
> >               struct xsk_addrs *xsk_addr;
>
> This wasn't introduced by this patch, but does the non-zerocopy
> xsk_build_skb() leak the xsk_addrs structure on its error path?

This issue will be fixed by a standalone patch then :)

>
> When appending a fragment to an existing SKB (!skb is false) in
> xsk_build_skb(), the code converts the destructor_arg from a direct address
> to an allocated xsk_addrs structure and initializes xsk_addr->num_descs = 1.
>
> If the subsequent alloc_page() fails, it jumps to free_err and returns
> -EAGAIN, leaving the partially built SKB in xs->skb without incrementing
> num_descs.
>
> When the socket is closed or the SKB is subsequently dropped via
> xsk_drop_skb(), xsk_consume_skb() is called. Because destructor_arg is now a
> valid pointer, it fetches num_descs as 1. The cleanup logic relies on:
>
>     if (unlikely(num_descs > 1))
>
> to free the xsk_addrs structure. Since num_descs is exactly 1, does it skip
> the free, permanently leaking the xsk_tx_generic_cache object?
>
> I saw this issue is addressed later in the patch series (xsk: fix xsk_addrs
> slab leak on multi-buffer error path).

Yep, it is handled by the patch you said.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-05-03 20:09   ` sashiko-bot
@ 2026-05-05 19:46     ` Jason Xing
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Xing @ 2026-05-05 19:46 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sun, May 3, 2026 at 11:09 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] In the copy-mode TX path (`xsk_build_skb`), if a multi-buffer fragment exceeds the `MAX_SKB_FRAGS` limit, the buffer's address is leaked to userspace because `0` is erroneously submitted to the completion queue.
> --
>
> commit 4a776aefc60aa4d3123e7e63fd8f927f3748d008
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: fix u64 descriptor address truncation on 32-bit architectures
>
> This commit fixes a 32-bit truncation issue when storing 64-bit descriptor
> addresses in the SKB destructor argument by allocating an xsk_addrs struct.
> It consolidates the allocation logic into xsk_addrs_alloc() and uses it in
> both zerocopy and copy modes.
>
> Link: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 079abd4bcb69..5e5786cd9af5 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
>
> [ ... ]
>
> > @@ -929,19 +959,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >                       struct page *page;
> >                       u8 *vaddr;
> >
> > -                     if (xsk_skb_destructor_is_addr(skb)) {
> > -                             xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
> > -                                                          GFP_KERNEL);
> > -                             if (!xsk_addr) {
> > -                                     err = -ENOMEM;
> > -                                     goto free_err;
> > -                             }
> > -
> > -                             xsk_addr->num_descs = 1;
> > -                             xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
> > -                             skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
> > -                     } else {
> > -                             xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > +                     xsk_addr = xsk_addrs_alloc(skb);
> > +                     if (!xsk_addr) {
> > +                             err = -ENOMEM;
> > +                             goto free_err;
> >                       }
> >
> >                       if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
>
> This isn't a regression introduced by this patch, but while looking at the
> modified code, does this error handling path leak the buffer's address to
> userspace?
>
> If this overflow check evaluates to true, the code sets err = -EOVERFLOW and
> jumps to the free_err label before assigning desc->addr to
> xsk_addr->addrs[xsk_addr->num_descs].
>
> In the free_err block, it handles -EOVERFLOW by calling
> xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb). This increments
> num_descs to account for the dropped descriptor and eventually calls
> xsk_destruct_skb() which calls xsk_cq_submit_addr_locked().
>
> Since xsk_addr->addrs was never initialized with desc->addr for this
> descriptor, does this mean it submits 0 (from kmem_cache_zalloc) to the
> completion queue instead of the actual buffer address, causing userspace
> to permanently lose track of the buffer?
>
> By comparison, xsk_build_skb_zerocopy() assigns desc->addr to the array
> before checking the fragment limit:
>
> net/xdp/xsk.c:xsk_build_skb_zerocopy() {
>     ...
>                 xsk_addr = xsk_addrs_alloc(skb);
>                 if (!xsk_addr)
>                         return ERR_PTR(-ENOMEM);
>
>                 /* in case of -EOVERFLOW that could happen below,
>                  * xsk_consume_skb() will release this node as whole skb
>                  * would be dropped, which implies freeing all list elements
>                  */
>                 xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
>         }
>
>         len = desc->len;
>     ...
> }
>
> Should xsk_build_skb() also record the descriptor address before checking
> the -EOVERFLOW condition?

It has already been uncovered by you, maybe in v4 or v3 :) Sure, it
will be fixed after this patch gets merged.

Thanks,
Jason

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260502200722.53960-1-kerneljasonxing@gmail.com?part=8

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
  2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (8 preceding siblings ...)
  2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
@ 2026-05-06  2:40 ` patchwork-bot+netdevbpf
  9 siblings, 0 replies; 27+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-05-06  2:40 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, kernelxing

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sat,  2 May 2026 23:07:14 +0300 you wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> There are rare issues around xsk_build_skb(). Some of them
> were founded by Sashiko[1][2].
> 
> [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
> 
> [...]

Here is the summary with links:
  - [net,v5,1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
    https://git.kernel.org/netdev/net/c/d73a9a63f9f7
  - [net,v5,2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
    https://git.kernel.org/netdev/net/c/0bb7a9caf5c1
  - [net,v5,3/8] xsk: handle NULL dereference of the skb without frags issue
    https://git.kernel.org/netdev/net/c/8cd3c1c6e7d9
  - [net,v5,4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
    https://git.kernel.org/netdev/net/c/0f3776583d28
  - [net,v5,5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
    https://git.kernel.org/netdev/net/c/3dec153ae484
  - [net,v5,6/8] xsk: avoid skb leak in XDP_TX_METADATA case
    https://git.kernel.org/netdev/net/c/8c2cff50afdd
  - [net,v5,7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
    https://git.kernel.org/netdev/net/c/e0f229025a8e
  - [net,v5,8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
    https://git.kernel.org/netdev/net/c/203cee647f55

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-05-06  2:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:18     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:26     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:28     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:32     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:36     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:43     ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
2026-05-03 20:09   ` sashiko-bot
2026-05-05 19:46     ` Jason Xing
2026-05-04 14:59   ` Stanislav Fomichev
2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
2026-05-05 19:09   ` Jason Xing
2026-05-06  2:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox