[PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation
@ 2026-04-20  8:27 Jason Xing
  2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:27 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

There are rare issues around xsk_build_skb(). Some of them
were founded by Sashiko[1][2].

[1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
[2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/

---
v2
Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
1. add four patches spotted by sashiko to fix buggy pre-existing
behavior
2. adjust the order of 8 patches.

Jason Xing (8):
  xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  xsk: handle NULL dereference of the skb without frags issue
  xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  xsk: avoid skb leak in XDP_TX_METADATA case
  xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  xsk: fix xsk_addrs slab leak on multi-buffer error path
  xsk: fix u64 descriptor address truncation on 32-bit architectures

 net/xdp/xsk.c           | 88 +++++++++++++++++++++++++++++++++--------
 net/xdp/xsk_buff_pool.c |  3 ++
 2 files changed, 75 insertions(+), 16 deletions(-)

-- 
2.41.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
@ 2026-04-20  8:27 ` Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  9:40   ` sashiko-bot
  2026-04-20  8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:27 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.

AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.

The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().

Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
   (invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
    consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
    the same descriptor and re-hits the same WARN until the socket
    is closed.

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk_buff_pool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..c2521b6547e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (force_zc && force_copy)
 		return -EINVAL;

+	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+		return -EOPNOTSUPP;
+
 	if (xsk_get_pool_from_qid(netdev, queue_id))
 		return -EBUSY;

-- 
2.41.3

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-04-20  8:27 ` Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  9:40   ` sashiko-bot
  2026-04-20  8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:27 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy (e.g., MAX_SKB_FRAGS exceeded), the free_err
EOVERFLOW handler unconditionally dereferences xs->skb via
xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL
pointer dereference.

In this series, the skb is already freed by kfree_skb() inside
xsk_build_skb_zerocopy for the first-descriptor case, so we only need
to do the bookkeeping: cancel the one reserved CQ slot and account for
the single invalid descriptor.

Guard the existing xsk_inc_num_desc/xsk_drop_skb calls with an
xs->skb check (for the continuation case), and add an else branch
for the first-descriptor case that manually cancels the CQ slot and
increments invalid_descs by one.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..6521604f8d42 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -893,9 +893,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-		/* Drop the packet */
-		xsk_inc_num_desc(xs->skb);
-		xsk_drop_skb(xs->skb);
+		if (xs->skb) {
+			/* Drop the packet */
+			xsk_inc_num_desc(xs->skb);
+			xsk_drop_skb(xs->skb);
+		} else {
+			xsk_cq_cancel_locked(xs->pool, 1);
+			xs->tx->invalid_descs++;
+		}
 		xskq_cons_release(xs->tx);
 	} else {
 		/* Let application retry */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
  2026-04-20  8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  9:40   ` sashiko-bot
  2026-04-20  8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.

If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:

  if (skb && !skb_shinfo(skb)->nr_frags)
      kfree_skb(skb);

evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.

Fix by adding a !xs->skb check to the condition, ensuring we only free
skbs that were freshly allocated in this call (xs->skb is NULL) and
never free an in-progress multi-buffer skb that the caller still
references.

Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6521604f8d42..4fdd1a45a9bd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return skb;
 
 free_err:
-	if (skb && !skb_shinfo(skb)->nr_frags)
+	if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (2 preceding siblings ...)
  2026-04-20  8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-20  8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.

Introduce a new common helper to fix the issue. That function will be
used by the subsequent patches soon.

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4fdd1a45a9bd..614e7bd1252b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 	return 0;
 }
 
+static void xsk_drop_untrans_skb(struct sk_buff *skb)
+{
+	skb->destructor = sock_wfree;
+	kfree_skb(skb);
+}
+
 static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 					      struct xdp_desc *desc)
 {
@@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 free_err:
 	if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
-		kfree_skb(skb);
+		xsk_drop_untrans_skb(skb);
 
 	if (err == -EOVERFLOW) {
 		if (xs->skb) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (3 preceding siblings ...)
  2026-04-20  8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-21  9:40   ` sashiko-bot
  2026-04-20  8:28 ` [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
   NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.

Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 614e7bd1252b..51f76e9d6ffd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -749,8 +749,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 		xsk_skb_init_misc(skb, xs, desc->addr);
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
-			if (unlikely(err))
+			if (unlikely(err)) {
+				xsk_drop_untrans_skb(skb);
 				return ERR_PTR(err);
+			}
 		}
 	} else {
 		struct xsk_addrs *xsk_addr;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (4 preceding siblings ...)
  2026-04-20  8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-20  8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
  2026-04-20  8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  7 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
   hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
   is why bug can be triggered.
4. there is no chance to free this skb anymore.

Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 51f76e9d6ffd..9236ec32b54a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -784,8 +784,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	addr = buffer - pool->addrs;
 
 	for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
-		if (unlikely(i >= MAX_SKB_FRAGS))
+		if (unlikely(i >= MAX_SKB_FRAGS)) {
+			if (!xs->skb)
+				xsk_drop_untrans_skb(skb);
 			return ERR_PTR(-EOVERFLOW);
+		}
 
 		page = pool->umem->pgs[addr >> PAGE_SHIFT];
 		get_page(page);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (5 preceding siblings ...)
  2026-04-20  8:28 ` [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-20 19:58   ` Stanislav Fomichev
  2026-04-20  8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  7 siblings, 1 reply; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).

If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1

If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:

    if (unlikely(num_descs > 1))
        kmem_cache_free(xsk_tx_generic_cache, destructor_arg);

Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.

Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9236ec32b54a..6b17974ca825 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
 	spin_lock_irqsave(&pool->cq_prod_lock, flags);
 	idx = xskq_get_prod(pool->cq);
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 
 		for (i = 0; i < num_descs; i++) {
@@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
 	u32 num_descs = xsk_get_num_desc(skb);
 	struct xsk_addrs *xsk_addr;
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
 	}
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (6 preceding siblings ...)
  2026-04-20  8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-04-20  8:28 ` Jason Xing
  2026-04-20 19:49   ` Stanislav Fomichev
  2026-04-21  9:40   ` sashiko-bot
  7 siblings, 2 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-20  8:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:

    skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);

On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
mode the chunk offset is encoded in bits 48-63 of the descriptor
address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
lost entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.

Fix this by handling the 32-bit case directly in
xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
struct (the same path already used for multi-descriptor SKBs) to store
the full u64 address.

Extend xsk_drop_untrans_skb() to free the xsk_addrs allocation on 32-bit
when the skb is dropped before transmission. Note that here we don't use
0x1UL method to judge in this case.

Also extend xsk_skb_destructor_is_addr() to cover 32-bit case like above.

The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 7 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6b17974ca825..bd49dbd9875b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -556,9 +556,23 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
 	return ret;
 }
 
+/*
+ * On 64-bit, destructor_arg can store an inline address directly
+ * (tagged with bit 0 set). On 32-bit, all addresses go through an
+ * allocated xsk_addrs struct instead. In that case this function
+ * returns true only when destructor_arg is NULL (set_addr has not
+ * yet been called or has failed).
+ *
+ * For all callers:
+ *   return true: no xsk_addrs struct to handle
+ *   return false: destructor_arg points to an xsk_addrs struct
+ */
 static bool xsk_skb_destructor_is_addr(struct sk_buff *skb)
 {
-	return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+	if (IS_ENABLED(CONFIG_64BIT))
+		return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+	else
+		return !skb_shinfo(skb)->destructor_arg;
 }
 
 static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
@@ -566,9 +580,21 @@ static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
 	return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
 }
 
-static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+static int xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
 {
+	if (!IS_ENABLED(CONFIG_64BIT)) {
+		struct xsk_addrs *xsk_addr;
+
+		xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
+		if (!xsk_addr)
+			return -ENOMEM;
+		xsk_addr->addrs[0] = addr;
+		skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+		return 0;
+	}
+
 	skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+	return 0;
 }
 
 static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -644,14 +670,14 @@ void xsk_destruct_skb(struct sk_buff *skb)
 	sock_wfree(skb);
 }
 
-static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
-			      u64 addr)
+static int xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
+			     u64 addr)
 {
 	skb->dev = xs->dev;
 	skb->priority = READ_ONCE(xs->sk.sk_priority);
 	skb->mark = READ_ONCE(xs->sk.sk_mark);
 	skb->destructor = xsk_destruct_skb;
-	xsk_skb_destructor_set_addr(skb, addr);
+	return xsk_skb_destructor_set_addr(skb, addr);
 }
 
 static void xsk_consume_skb(struct sk_buff *skb)
@@ -719,6 +745,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
 
 static void xsk_drop_untrans_skb(struct sk_buff *skb)
 {
+	if (!IS_ENABLED(CONFIG_64BIT) && !xsk_skb_destructor_is_addr(skb)) {
+		struct xsk_addrs *xsk_addr;
+
+		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
+	}
 	skb->destructor = sock_wfree;
 	kfree_skb(skb);
 }
@@ -746,7 +778,12 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 
 		skb_reserve(skb, hr);
 
-		xsk_skb_init_misc(skb, xs, desc->addr);
+		err = xsk_skb_init_misc(skb, xs, desc->addr);
+		if (unlikely(err)) {
+			xsk_drop_untrans_skb(skb);
+			return ERR_PTR(err);
+		}
+
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
 			if (unlikely(err)) {
@@ -845,7 +882,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			if (unlikely(err))
 				goto free_err;
 
-			xsk_skb_init_misc(skb, xs, desc->addr);
+			err = xsk_skb_init_misc(skb, xs, desc->addr);
+			if (unlikely(err))
+				goto free_err;
+
 			if (desc->options & XDP_TX_METADATA) {
 				err = xsk_skb_metadata(skb, buffer, desc,
 						       xs->pool, hr);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-20 23:51     ` Jason Xing
  2026-04-21  9:40   ` sashiko-bot
  1 sibling, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev, Jason Xing

> From: Jason Xing <kernelxing@tencent.com>
> 
> skb_checksum_help() is a common helper that writes the folded
> 16-bit checksum back via skb->data + csum_start + csum_offset,
> i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> and -EINVAL) when skb_headlen() is 0.
> 
> AF_XDP generic xmit takes two very different paths depending on the
> netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> skip the "copy payload into a linear head" step on purpose as a
> performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> pages as frags and never calls skb_put(), so skb_headlen() stays 0
> for the whole skb. For these skbs there is simply no linear area for
> skb_checksum_help() to write the csum into - the sw-csum fallback is
> structurally inapplicable.
> 
> The patch tries to catch this and reject the combination with error at
> setup time. Rejecting at bind() converts this silent per-packet failure
> into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> because they do not call skb_checksum_help().
> 
> Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> XDP_TXMD_FLAGS_CHECKSUM' produces:
> 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> 2) sendmsg() returning -EINVAL without consuming the descriptor
>    (invalid_descs is not incremented),
> 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
>     consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
>     the same descriptor and re-hits the same WARN until the socket
>     is closed.
> 
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/xdp/xsk_buff_pool.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 37b7a68b89b3..c2521b6547e3 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (force_zc && force_copy)
>  		return -EINVAL;
>  
> +	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> +		return -EOPNOTSUPP;
> +
>  	if (xsk_get_pool_from_qid(netdev, queue_id))
>  		return -EBUSY;
>  
> -- 
> 2.41.3
>

Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
to calculate TX checksum in SW")?

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-20  8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  9:40   ` sashiko-bot
  1 sibling, 0 replies; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev, Jason Xing

> From: Jason Xing <kernelxing@tencent.com>
> 
> When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
> xsk_build_skb_zerocopy (e.g., MAX_SKB_FRAGS exceeded), the free_err
> EOVERFLOW handler unconditionally dereferences xs->skb via
> xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL
> pointer dereference.
> 
> In this series, the skb is already freed by kfree_skb() inside
> xsk_build_skb_zerocopy for the first-descriptor case, so we only need
> to do the bookkeeping: cancel the one reserved CQ slot and account for
> the single invalid descriptor.
> 
> Guard the existing xsk_inc_num_desc/xsk_drop_skb calls with an
> xs->skb check (for the continuation case), and add an else branch
> for the first-descriptor case that manually cancels the CQ slot and
> increments invalid_descs by one.
> 
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/xdp/xsk.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 6149f6a79897..6521604f8d42 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -893,9 +893,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
> -		/* Drop the packet */
> -		xsk_inc_num_desc(xs->skb);
> -		xsk_drop_skb(xs->skb);
> +		if (xs->skb) {
> +			/* Drop the packet */
> +			xsk_inc_num_desc(xs->skb);
> +			xsk_drop_skb(xs->skb);
> +		} else {
> +			xsk_cq_cancel_locked(xs->pool, 1);
> +			xs->tx->invalid_descs++;
> +		}
>  		xskq_cons_release(xs->tx);
>  	} else {
>  		/* Let application retry */
> -- 
> 2.41.3
>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-20  8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  0:01     ` Jason Xing
  2026-04-21  9:40   ` sashiko-bot
  1 sibling, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev, Jason Xing

> From: Jason Xing <kernelxing@tencent.com>
> 
> When xsk_build_skb() processes multi-buffer packets in copy mode, the
> first descriptor stores data into the skb linear area without adding
> any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> to accumulate subsequent descriptors.
> 
> If a continuation descriptor fails (e.g. alloc_page returns NULL with
> -EAGAIN), we jump to free_err where the condition:
> 
>   if (skb && !skb_shinfo(skb)->nr_frags)
>       kfree_skb(skb);
> 
> evaluates to true because nr_frags is still 0 (the first descriptor
> used the linear area, not frags). This frees the skb while xs->skb
> still points to it, creating a dangling pointer. On the next transmit
> attempt or socket close, xs->skb is dereferenced, causing a
> use-after-free or double-free.
> 
> Fix by adding a !xs->skb check to the condition, ensuring we only free
> skbs that were freshly allocated in this call (xs->skb is NULL) and
> never free an in-progress multi-buffer skb that the caller still
> references.
> 
> Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
> Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/xdp/xsk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 6521604f8d42..4fdd1a45a9bd 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  	return skb;
>  
>  free_err:
> -	if (skb && !skb_shinfo(skb)->nr_frags)
> +	if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
> -- 
> 2.41.3

Now "!skb_shinfo(skb)->nr_frags" feels redundant? It's either
"skb && !xs->skb" and we own the kfree. or "xs->skb != NULL" and we
want xsk_drop_skb? Or am I missing something?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-04-20  8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-04-20 19:34   ` Stanislav Fomichev
  2026-04-21  0:51     ` Jason Xing
  0 siblings, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev, Jason Xing

> From: Jason Xing <kernelxing@tencent.com>
> 
> Once xsk_skb_init_misc() has been called on an skb, its destructor is
> set to xsk_destruct_skb(), which submits the descriptor address(es) to
> the completion queue and advances the CQ producer. If such an skb is
> subsequently freed via kfree_skb() along an error path - before the
> skb has ever been handed to the driver - the destructor still runs and
> submits a bogus, half-initialized address to the CQ.
> 
> Introduce a new common helper to fix the issue. That function will be
> used by the subsequent patches soon.
> 
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
> Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/xdp/xsk.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4fdd1a45a9bd..614e7bd1252b 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
>  	return 0;
>  }
>  
> +static void xsk_drop_untrans_skb(struct sk_buff *skb)
> +{
> +	skb->destructor = sock_wfree;
> +	kfree_skb(skb);
> +}
> +
>  static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  					      struct xdp_desc *desc)
>  {
> @@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  
>  free_err:
>  	if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> -		kfree_skb(skb);
> +		xsk_drop_untrans_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
>  		if (xs->skb) {
> -- 
> 2.41.3
>

Have you considered the alternative where we postpone `skb->destructor =
xsk_destruct_skb` to a later point? Will this be less messy than
undoing that descriptor in a few curated places?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-20  8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-04-20 19:49   ` Stanislav Fomichev
  2026-04-21  0:49     ` Jason Xing
  2026-04-21  9:40   ` sashiko-bot
  1 sibling, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:49 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, bpf, netdev, Jason Xing

On 04/20, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> uintptr_t cast:
> 
>     skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> 
> On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> mode the chunk offset is encoded in bits 48-63 of the descriptor
> address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> lost entirely. The completion queue then returns a truncated address to
> userspace, making buffer recycling impossible.
> 
> Fix this by handling the 32-bit case directly in
> xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> struct (the same path already used for multi-descriptor SKBs) to store
> the full u64 address.

Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does
anybody seriously run af_xdp on 32 bit systems?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
  2026-04-20  8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-04-20 19:58   ` Stanislav Fomichev
  0 siblings, 0 replies; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:58 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, bpf, netdev, Jason Xing

On 04/20, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
> continuation descriptor, it promotes destructor_arg from an inlined
> address to a freshly allocated xsk_addrs (num_descs = 1). The counter
> is bumped to >= 2 only at the very end of a successful build (by calling
> xsk_inc_num_desc()).
> 
> If the build fails in between (e.g. alloc_page() returns NULL with
> -EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
> calling xsk_inc_num_desc() to increment num_descs and leave the half-built
> skb attached to xs->skb for the app to retry. The skb now has
> 1) destructor_arg = a real xsk_addrs pointer,
> 2) num_descs = 1
> 
> If the app never retries and just close()s the socket, xsk_release()
> calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
> free xsk_addrs by testing num_descs > 1:
> 
>     if (unlikely(num_descs > 1))
>         kmem_cache_free(xsk_tx_generic_cache, destructor_arg);
> 
> Because num_descs is exactly 1 the branch is skipped and the
> xsk_addrs object is leaked to the xsk_tx_generic_cache slab.
> 
> Fix it by directly testing if destructor_arg is still addr. Or else it
> is modified and used to store the newly allocated memory from
> xsk_tx_generic_cache regardless of increment of num_desc, which we
> need to handle.
> 
> Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
> Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/xdp/xsk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 9236ec32b54a..6b17974ca825 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
>  	spin_lock_irqsave(&pool->cq_prod_lock, flags);
>  	idx = xskq_get_prod(pool->cq);
>  
> -	if (unlikely(num_descs > 1)) {
> +	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
>  		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
>  
>  		for (i = 0; i < num_descs; i++) {
> @@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
>  	u32 num_descs = xsk_get_num_desc(skb);
>  	struct xsk_addrs *xsk_addr;
>  
> -	if (unlikely(num_descs > 1)) {
> +	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
>  		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
>  		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
>  	}
> -- 
> 2.41.3
> 

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-20 23:51     ` Jason Xing
  2026-04-21 22:20       ` Stanislav Fomichev
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Xing @ 2026-04-20 23:51 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing

On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > skb_checksum_help() is a common helper that writes the folded
> > 16-bit checksum back via skb->data + csum_start + csum_offset,
> > i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> > and -EINVAL) when skb_headlen() is 0.
> >
> > AF_XDP generic xmit takes two very different paths depending on the
> > netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> > skip the "copy payload into a linear head" step on purpose as a
> > performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> > pages as frags and never calls skb_put(), so skb_headlen() stays 0
> > for the whole skb. For these skbs there is simply no linear area for
> > skb_checksum_help() to write the csum into - the sw-csum fallback is
> > structurally inapplicable.
> >
> > The patch tries to catch this and reject the combination with error at
> > setup time. Rejecting at bind() converts this silent per-packet failure
> > into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> > launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> > because they do not call skb_checksum_help().
> >
> > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > 2) sendmsg() returning -EINVAL without consuming the descriptor
> >    (invalid_descs is not incremented),
> > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> >     consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> >     the same descriptor and re-hits the same WARN until the socket
> >     is closed.
> >
> > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> > Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/xdp/xsk_buff_pool.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index 37b7a68b89b3..c2521b6547e3 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >       if (force_zc && force_copy)
> >               return -EINVAL;
> >
> > +     if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > +             return -EOPNOTSUPP;
> > +
> >       if (xsk_get_pool_from_qid(netdev, queue_id))
> >               return -EBUSY;
> >
> > --
> > 2.41.3
> >
>
> Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
> to calculate TX checksum in SW")?
>
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>

Thanks for the check. But not really. It is the commit 30c3055f9c0d
that brings the csum support of IFF_TX_SKB_NO_LINEAR case where this
issue can be triggered (because this mode no longer puts data into skb
linear area).

Thanks,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-21  0:01     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21  0:01 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing

On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When xsk_build_skb() processes multi-buffer packets in copy mode, the
> > first descriptor stores data into the skb linear area without adding
> > any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> > to accumulate subsequent descriptors.
> >
> > If a continuation descriptor fails (e.g. alloc_page returns NULL with
> > -EAGAIN), we jump to free_err where the condition:
> >
> >   if (skb && !skb_shinfo(skb)->nr_frags)
> >       kfree_skb(skb);
> >
> > evaluates to true because nr_frags is still 0 (the first descriptor
> > used the linear area, not frags). This frees the skb while xs->skb
> > still points to it, creating a dangling pointer. On the next transmit
> > attempt or socket close, xs->skb is dereferenced, causing a
> > use-after-free or double-free.
> >
> > Fix by adding a !xs->skb check to the condition, ensuring we only free
> > skbs that were freshly allocated in this call (xs->skb is NULL) and
> > never free an in-progress multi-buffer skb that the caller still
> > references.
> >
> > Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
> > Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/xdp/xsk.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 6521604f8d42..4fdd1a45a9bd 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       return skb;
> >
> >  free_err:
> > -     if (skb && !skb_shinfo(skb)->nr_frags)
> > +     if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> >               kfree_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
> > --
> > 2.41.3
>
> Now "!skb_shinfo(skb)->nr_frags" feels redundant? It's either
> "skb && !xs->skb" and we own the kfree. or "xs->skb != NULL" and we
> want xsk_drop_skb? Or am I missing something?

Your feeling about being redundant is right. I'm removing it now:)

At this stage, the job of this if statement is to find out the first
skb, so !xs->skb is a clear indicator as you said.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-20 19:49   ` Stanislav Fomichev
@ 2026-04-21  0:49     ` Jason Xing
  2026-04-21 22:23       ` Stanislav Fomichev
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Xing @ 2026-04-21  0:49 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, bpf, netdev, Jason Xing

On Tue, Apr 21, 2026 at 3:49 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> On 04/20, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> > descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> > uintptr_t cast:
> >
> >     skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> >
> > On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> > the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> > mode the chunk offset is encoded in bits 48-63 of the descriptor
> > address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> > lost entirely. The completion queue then returns a truncated address to
> > userspace, making buffer recycling impossible.
> >
> > Fix this by handling the 32-bit case directly in
> > xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> > struct (the same path already used for multi-descriptor SKBs) to store
> > the full u64 address.
>
> Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does

Of course, it would be super easy. Actually the initial version looks
like this. One line of coder is simply enough.

> anybody seriously run af_xdp on 32 bit systems?

But my worry as you guess is if there exists a 32 bit system? I doubt
it. That's why I put some effort into adding the compatibility code to
cover the case. Good news is that it doesn't add any side effects of
the 64 bit system since they are protected under IS_ENABLED condition.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-21  0:51     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21  0:51 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing

On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Once xsk_skb_init_misc() has been called on an skb, its destructor is
> > set to xsk_destruct_skb(), which submits the descriptor address(es) to
> > the completion queue and advances the CQ producer. If such an skb is
> > subsequently freed via kfree_skb() along an error path - before the
> > skb has ever been handed to the driver - the destructor still runs and
> > submits a bogus, half-initialized address to the CQ.
> >
> > Introduce a new common helper to fix the issue. That function will be
> > used by the subsequent patches soon.
> >
> > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
> > Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/xdp/xsk.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4fdd1a45a9bd..614e7bd1252b 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> >       return 0;
> >  }
> >
> > +static void xsk_drop_untrans_skb(struct sk_buff *skb)
> > +{
> > +     skb->destructor = sock_wfree;
> > +     kfree_skb(skb);
> > +}
> > +
> >  static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >                                             struct xdp_desc *desc)
> >  {
> > @@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> >  free_err:
> >       if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> > -             kfree_skb(skb);
> > +             xsk_drop_untrans_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
> >               if (xs->skb) {
> > --
> > 2.41.3
> >
>
> Have you considered the alternative where we postpone `skb->destructor =
> xsk_destruct_skb` to a later point? Will this be less messy than
> undoing that descriptor in a few curated places?

Thanks. It does make sense. Adding the following code seems fine:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 887abed25466..02e723ca3080 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -820,7 +820,6 @@ static struct sk_buff
*xsk_build_skb_zerocopy(struct xdp_sock *xs,

                skb_reserve(skb, hr);

-               xsk_skb_init_misc(skb, xs, desc->addr);
                if (desc->options & XDP_TX_METADATA) {
                        err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
                        if (unlikely(err))
@@ -914,7 +913,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
                        if (unlikely(err))
                                goto free_err;

-                       xsk_skb_init_misc(skb, xs, desc->addr);
                        if (desc->options & XDP_TX_METADATA) {
                                err = xsk_skb_metadata(skb, buffer, desc,
                                                       xs->pool, hr);
@@ -964,6 +962,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
                }
        }

+       if (!xs->skb)
+               xsk_skb_init_misc(skb, xs, desc->addr);
        xsk_inc_num_desc(skb);

        return skb;

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-21  9:40   ` sashiko-bot
  2026-04-21 12:39     ` Jason Xing
  1 sibling, 1 reply; 31+ messages in thread
From: sashiko-bot @ 2026-04-21  9:40 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] The patch unconditionally rejects pool assignment for devices with IFF_TX_SKB_NO_LINEAR if tx_sw_csum is enabled, which incorrectly breaks native Zero-Copy (ZC) setups on those devices.
--

commit 5beba1aaed8c200424f98af3c07472ab4da8d451
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices

This commit rejects UMEM bindings for devices that advertise
IFF_TX_SKB_NO_LINEAR when software checksumming (tx_sw_csum) is enabled.
This prevents a warning and panic in skb_checksum_help() which relies on a
linear skb head.

Link: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t

> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index cd7bc50872f6b..d981cfdd85357 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -175,6 +175,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (force_zc && force_copy)
>  		return -EINVAL;
>  
> +	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> +		return -EOPNOTSUPP;
> +

Does this unconditionally reject the pool assignment even for native
Zero-Copy (ZC) setups?

In native XDP_ZEROCOPY mode, AF_XDP descriptors are sent directly to the
hardware. No sk_buff is allocated, so skb_checksum_help() is never called,
making the tx_sw_csum flag irrelevant and safe to ignore.

By placing this check before determining if native ZC is available, does
this break ZC mode on drivers like virtio_net for applications that pass
the XDP_UMEM_TX_SW_CSUM flag as a fallback?

Should this check be deferred until we are certain we are using generic
copy mode, such as when force_copy is true or after native ZC setup fails?

>  	if (xsk_get_pool_from_qid(netdev, queue_id))
>  		return -EBUSY;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-20  8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-21  9:40   ` sashiko-bot
  2026-04-21 12:46     ` Jason Xing
  1 sibling, 1 reply; 31+ messages in thread
From: sashiko-bot @ 2026-04-21  9:40 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] Memory leak of newly allocated skb in `xsk_build_skb_zerocopy()` on error paths for the first descriptor.
--

commit c0083470d60454acdbd801635be8f00a9b79a9eb
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: handle NULL dereference of the skb without frags issue

This commit addresses a NULL pointer dereference in the error handling
path of xsk_build_skb by explicitly checking if xs->skb is NULL. For the
first-descriptor case, it cancels the CQ slot and tracks the invalid
descriptor instead of dropping the skb.

> In this series, the skb is already freed by kfree_skb() inside
> xsk_build_skb_zerocopy for the first-descriptor case, so we only need
> to do the bookkeeping: cancel the one reserved CQ slot and account for
> the single invalid descriptor.

Is this description accurate regarding the skb being freed?

Looking at xsk_build_skb(), if xsk_build_skb_zerocopy() encounters an
error and returns an error pointer, the local skb variable is set to
NULL before jumping to the free_err label:

    skb = xsk_build_skb_zerocopy(xs, desc);
    if (IS_ERR(skb)) {
        err = PTR_ERR(skb);
        skb = NULL;
        goto free_err;
    }

Since skb is NULL, the cleanup code at free_err is bypassed:

free_err:
    if (skb && !skb_shinfo(skb)->nr_frags)
        kfree_skb(skb);

Does this code leak the newly allocated skb and its sk_wmem_alloc charge
when xsk_build_skb_zerocopy() fails on the first descriptor?

I see this is addressed later in the series by commit aef1b3344bd95
("xsk: avoid skb leak in XDP_TX_METADATA case") and commit 93315725e1616
("xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS"), but
the commit message here appears to incorrectly assume the free is already
happening.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 887abed254668..54d4895fc11eb 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[ ... ]

> @@ -973,9 +973,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
> -		/* Drop the packet */
> -		xsk_inc_num_desc(xs->skb);
> -		xsk_drop_skb(xs->skb);
> +		if (xs->skb) {
> +			/* Drop the packet */
> +			xsk_inc_num_desc(xs->skb);
> +			xsk_drop_skb(xs->skb);
> +		} else {
> +			xsk_cq_cancel_locked(xs->pool, 1);
> +			xs->tx->invalid_descs++;
> +		}

Because the skb wasn't freed as the commit message assumed, can a
malicious application exploit this missing kfree_skb by repeatedly
supplying abnormally large descriptors to exhaust socket buffer memory?

>  		xskq_cons_release(xs->tx);
>  	} else {
>  		/* Let application retry */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-20  8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
  2026-04-20 19:34   ` Stanislav Fomichev
@ 2026-04-21  9:40   ` sashiko-bot
  2026-04-21 12:51     ` Jason Xing
  1 sibling, 1 reply; 31+ messages in thread
From: sashiko-bot @ 2026-04-21  9:40 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [Critical] Dangling `xs->skb` pointer on `-EOVERFLOW` error path leading to Use-After-Free.
--

commit 4ce0cecf5e8e18bd369c02242f70a42cc8fbdf03
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path

This commit aims to fix a use-after-free in the xsk_build_skb() error
handling path by ensuring we only free newly allocated skbs when a
continuation descriptor fails. It prevents freeing an in-progress
multi-buffer skb by adding a check for !xs->skb before kfree_skb(skb).

Link: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -969,7 +969,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
[ ... ]
>  free_err:
> -	if (skb && !skb_shinfo(skb)->nr_frags)
> +	if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {

Does this patch leave a use-after-free on the -EOVERFLOW error path?

When xsk_build_skb() handles a continuation descriptor and encounters an
-EOVERFLOW error, it jumps to free_err and enters this block. Looking at the
subsequent code in xsk_build_skb():

	if (err == -EOVERFLOW) {
		if (xs->skb) {
			/* Drop the packet */
			xsk_inc_num_desc(xs->skb);
			xsk_drop_skb(xs->skb);
		} else {

The xsk_drop_skb(xs->skb) call consumes and frees the socket buffer, but
xs->skb is never set to NULL.

Because xsk_build_skb() returns ERR_PTR(-EOVERFLOW), the caller
__xsk_generic_xmit() intercepts this error and continues its transmission
loop without aborting.

On the next iteration, __xsk_generic_xmit() would pass the dangling xs->skb
pointer back into xsk_build_skb(), which will dereference the freed memory
when checking skb_shinfo(skb)->nr_frags.

Should xs->skb be cleared to NULL after calling xsk_drop_skb() here?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-04-20  8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-04-21  9:40   ` sashiko-bot
  2026-04-21 12:58     ` Jason Xing
  0 siblings, 1 reply; 31+ messages in thread
From: sashiko-bot @ 2026-04-21  9:40 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [Critical] Time-of-Check to Time-of-Use (TOCTOU) vulnerability in `xsk_skb_metadata` allows unprivileged users to bypass bounds checks and cause out-of-bounds memory accesses.
- [High] Software checksumming calculation fails completely because `skb_checksum_help()` is invoked on an incomplete and non-linear SKB.
--

commit aef1b3344bd952f39b1181e046ab7d9ce77fbc8a
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: avoid skb leak in XDP_TX_METADATA case

This commit explicitly adds a call to kfree_skb() (via
xsk_drop_untrans_skb) when xsk_skb_metadata() returns an error, preventing
an skb leak if metadata processing fails.

Link: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 55b448abb8178..fd5432fb807e1 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -829,8 +829,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
>  		xsk_skb_init_misc(skb, xs, desc->addr);
>  		if (desc->options & XDP_TX_METADATA) {
>  			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);

While looking at this metadata handling, I noticed a couple of pre-existing
regressions.

Will skb_checksum_help() always fail when called here?

xsk_skb_metadata() is called right after allocating the SKB but before any
payload fragments are appended via skb_fill_page_desc() further down in
xsk_build_skb_zerocopy(). This means skb->len and skb_headlen(skb) are 0.

If software checksumming is enabled, xsk_skb_metadata() calls
skb_checksum_help(skb). Since skb_headlen(skb) is 0, won't the check
offset >= skb_headlen(skb) in skb_checksum_help() always evaluate to true,
triggering a warning and causing the packet to be dropped?

Also, can a user-space program modify the checksum metadata concurrently?

In xsk_skb_metadata(), the metadata is read directly from the UMEM buffer,
which is shared memory mapped into user-space:

net/xdp/xsk.c:xsk_skb_metadata() {
    ...
	if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
		if (unlikely(meta->request.csum_start +
			     meta->request.csum_offset +
			     sizeof(__sum16) > desc->len))
			return -EINVAL;

		skb->csum_start = hr + meta->request.csum_start;
		skb->csum_offset = meta->request.csum_offset;
    ...
}

The bounds check does not use READ_ONCE(). Could a user-space application
modify meta->request.csum_start and meta->request.csum_offset between the
check and the assignment? If so, it might bypass the bounds check and specify
arbitrarily large offsets, causing out-of-bounds writes when the hardware
inserts the checksum.

> -			if (unlikely(err))
> +			if (unlikely(err)) {
> +				xsk_drop_untrans_skb(skb);
>  				return ERR_PTR(err);
> +			}
>  		}
>  	} else {
>  		struct xsk_addrs *xsk_addr;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-20  8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  2026-04-20 19:49   ` Stanislav Fomichev
@ 2026-04-21  9:40   ` sashiko-bot
  2026-04-21 13:01     ` Jason Xing
  1 sibling, 1 reply; 31+ messages in thread
From: sashiko-bot @ 2026-04-21  9:40 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [High] On 32-bit systems, xsk_drop_untrans_skb() will call kmem_cache_free() on an uninitialized destructor_arg pointer in early allocation failure paths, causing slab corruption.
--

commit cf643670755512ad76d19d6d0707e86a2a05c671
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: fix u64 descriptor address truncation on 32-bit architectures

This commit fixes a truncation issue of the 64-bit descriptor address
on 32-bit architectures by allocating an xsk_addrs struct to store the
full address instead of packing it into the destructor_arg pointer.
It also updates the related cleanup and initialization functions to
handle the newly allocated struct.

Link: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 9c06a4e868725..0232559d5348f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
[ ... ]
> @@ -799,6 +825,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
>  
>  static void xsk_drop_untrans_skb(struct sk_buff *skb)
>  {
> +	if (!IS_ENABLED(CONFIG_64BIT) && !xsk_skb_destructor_is_addr(skb)) {
> +		struct xsk_addrs *xsk_addr;
> +
> +		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> +		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
> +	}

Could this pass an uninitialized pointer to kmem_cache_free() and cause slab
corruption?

When a new socket buffer is allocated via sock_alloc_send_skb(), the
networking core's __alloc_skb() function only zeroes the skb_shared_info
struct up to the dataref field. The destructor_arg field resides after
dataref and contains uninitialized garbage from the slab.

If an error occurs before destructor_arg is explicitly initialized, such as
when xsk_skb_init_misc() returns -ENOMEM:

xsk_build_skb_zerocopy()
    ...
    skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
    ...
    err = xsk_skb_init_misc(skb, xs, desc->addr);
    if (unlikely(err)) {
        xsk_drop_untrans_skb(skb);
        return ERR_PTR(err);
    }

In xsk_drop_untrans_skb() on 32-bit systems, xsk_skb_destructor_is_addr(skb)
evaluates to !skb_shinfo(skb)->destructor_arg. Because destructor_arg
contains non-zero garbage, the function returns false. The check in
xsk_drop_untrans_skb() (!xsk_skb_destructor_is_addr(skb)) then succeeds,
passing the garbage pointer to kmem_cache_free().

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=8

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-21  9:40   ` sashiko-bot
@ 2026-04-21 12:39     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21 12:39 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 21, 2026 at 5:40 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] The patch unconditionally rejects pool assignment for devices with IFF_TX_SKB_NO_LINEAR if tx_sw_csum is enabled, which incorrectly breaks native Zero-Copy (ZC) setups on those devices.
> --
>
> commit 5beba1aaed8c200424f98af3c07472ab4da8d451
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
>
> This commit rejects UMEM bindings for devices that advertise
> IFF_TX_SKB_NO_LINEAR when software checksumming (tx_sw_csum) is enabled.
> This prevents a warning and panic in skb_checksum_help() which relies on a
> linear skb head.
>
> Link: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
>
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index cd7bc50872f6b..d981cfdd85357 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -175,6 +175,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >       if (force_zc && force_copy)
> >               return -EINVAL;
> >
> > +     if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > +             return -EOPNOTSUPP;
> > +
>
> Does this unconditionally reject the pool assignment even for native
> Zero-Copy (ZC) setups?

It's weird. Nobody uses sw csum in zc mode.... If someone uses this
and gets warned, he will surely root cause the reason and finally
adjust the code in application.

>
> In native XDP_ZEROCOPY mode, AF_XDP descriptors are sent directly to the
> hardware. No sk_buff is allocated, so skb_checksum_help() is never called,
> making the tx_sw_csum flag irrelevant and safe to ignore.
>
> By placing this check before determining if native ZC is available, does
> this break ZC mode on drivers like virtio_net for applications that pass
> the XDP_UMEM_TX_SW_CSUM flag as a fallback?
>
> Should this check be deferred until we are certain we are using generic
> copy mode, such as when force_copy is true or after native ZC setup fails?

It's more complex. I'll keep my code as-is unless someone screams at
me to handle this trivial case.

Thanks,
Jason

>
> >       if (xsk_get_pool_from_qid(netdev, queue_id))
> >               return -EBUSY;
> >
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-21  9:40   ` sashiko-bot
@ 2026-04-21 12:46     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21 12:46 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 21, 2026 at 5:40 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] Memory leak of newly allocated skb in `xsk_build_skb_zerocopy()` on error paths for the first descriptor.
> --
>
> commit c0083470d60454acdbd801635be8f00a9b79a9eb
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: handle NULL dereference of the skb without frags issue
>
> This commit addresses a NULL pointer dereference in the error handling
> path of xsk_build_skb by explicitly checking if xs->skb is NULL. For the
> first-descriptor case, it cancels the CQ slot and tracks the invalid
> descriptor instead of dropping the skb.
>
> > In this series, the skb is already freed by kfree_skb() inside
> > xsk_build_skb_zerocopy for the first-descriptor case, so we only need
> > to do the bookkeeping: cancel the one reserved CQ slot and account for
> > the single invalid descriptor.
>
> Is this description accurate regarding the skb being freed?
>
> Looking at xsk_build_skb(), if xsk_build_skb_zerocopy() encounters an
> error and returns an error pointer, the local skb variable is set to
> NULL before jumping to the free_err label:
>
>     skb = xsk_build_skb_zerocopy(xs, desc);
>     if (IS_ERR(skb)) {
>         err = PTR_ERR(skb);
>         skb = NULL;
>         goto free_err;
>     }
>
> Since skb is NULL, the cleanup code at free_err is bypassed:
>
> free_err:
>     if (skb && !skb_shinfo(skb)->nr_frags)
>         kfree_skb(skb);
>
> Does this code leak the newly allocated skb and its sk_wmem_alloc charge
> when xsk_build_skb_zerocopy() fails on the first descriptor?
>
> I see this is addressed later in the series by commit aef1b3344bd95
> ("xsk: avoid skb leak in XDP_TX_METADATA case") and commit 93315725e1616
> ("xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS"), but
> the commit message here appears to incorrectly assume the free is already
> happening.

They are already fixed in the series. And what I mentioned is "In the
series,...".

>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 887abed254668..54d4895fc11eb 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
>
> [ ... ]
>
> > @@ -973,9 +973,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >               kfree_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
> > -             /* Drop the packet */
> > -             xsk_inc_num_desc(xs->skb);
> > -             xsk_drop_skb(xs->skb);
> > +             if (xs->skb) {
> > +                     /* Drop the packet */
> > +                     xsk_inc_num_desc(xs->skb);
> > +                     xsk_drop_skb(xs->skb);
> > +             } else {
> > +                     xsk_cq_cancel_locked(xs->pool, 1);
> > +                     xs->tx->invalid_descs++;
> > +             }
>
> Because the skb wasn't freed as the commit message assumed, can a
> malicious application exploit this missing kfree_skb by repeatedly
> supplying abnormally large descriptors to exhaust socket buffer memory?

It can happen but it doesn't belong to what I should here.

>
> >               xskq_cons_release(xs->tx);
> >       } else {
> >               /* Let application retry */
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-21  9:40   ` sashiko-bot
@ 2026-04-21 12:51     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21 12:51 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 21, 2026 at 5:40 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [Critical] Dangling `xs->skb` pointer on `-EOVERFLOW` error path leading to Use-After-Free.
> --
>
> commit 4ce0cecf5e8e18bd369c02242f70a42cc8fbdf03
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
>
> This commit aims to fix a use-after-free in the xsk_build_skb() error
> handling path by ensuring we only free newly allocated skbs when a
> continuation descriptor fails. It prevents freeing an in-progress
> multi-buffer skb by adding a check for !xs->skb before kfree_skb(skb).
>
> Link: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -969,7 +969,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> [ ... ]
> >  free_err:
> > -     if (skb && !skb_shinfo(skb)->nr_frags)
> > +     if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> >               kfree_skb(skb);
> >
> >       if (err == -EOVERFLOW) {
>
> Does this patch leave a use-after-free on the -EOVERFLOW error path?
>
> When xsk_build_skb() handles a continuation descriptor and encounters an
> -EOVERFLOW error, it jumps to free_err and enters this block. Looking at the
> subsequent code in xsk_build_skb():
>
>         if (err == -EOVERFLOW) {
>                 if (xs->skb) {
>                         /* Drop the packet */
>                         xsk_inc_num_desc(xs->skb);
>                         xsk_drop_skb(xs->skb);
>                 } else {
>
> The xsk_drop_skb(xs->skb) call consumes and frees the socket buffer, but
> xs->skb is never set to NULL.

This conclusion doesn't hold since this path (xsk_consume_skb(skb) ->
xs->skb = NULL) handles this.

>
> Because xsk_build_skb() returns ERR_PTR(-EOVERFLOW), the caller
> __xsk_generic_xmit() intercepts this error and continues its transmission
> loop without aborting.
>
> On the next iteration, __xsk_generic_xmit() would pass the dangling xs->skb
> pointer back into xsk_build_skb(), which will dereference the freed memory
> when checking skb_shinfo(skb)->nr_frags.
>
> Should xs->skb be cleared to NULL after calling xsk_drop_skb() here?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=3

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-04-21  9:40   ` sashiko-bot
@ 2026-04-21 12:58     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21 12:58 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 21, 2026 at 5:40 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [Critical] Time-of-Check to Time-of-Use (TOCTOU) vulnerability in `xsk_skb_metadata` allows unprivileged users to bypass bounds checks and cause out-of-bounds memory accesses.
> - [High] Software checksumming calculation fails completely because `skb_checksum_help()` is invoked on an incomplete and non-linear SKB.
> --
>
> commit aef1b3344bd952f39b1181e046ab7d9ce77fbc8a
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: avoid skb leak in XDP_TX_METADATA case
>
> This commit explicitly adds a call to kfree_skb() (via
> xsk_drop_untrans_skb) when xsk_skb_metadata() returns an error, preventing
> an skb leak if metadata processing fails.
>
> Link: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 55b448abb8178..fd5432fb807e1 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -829,8 +829,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> >               xsk_skb_init_misc(skb, xs, desc->addr);
> >               if (desc->options & XDP_TX_METADATA) {
> >                       err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
>
> While looking at this metadata handling, I noticed a couple of pre-existing
> regressions.
>
> Will skb_checksum_help() always fail when called here?
>
> xsk_skb_metadata() is called right after allocating the SKB but before any
> payload fragments are appended via skb_fill_page_desc() further down in
> xsk_build_skb_zerocopy(). This means skb->len and skb_headlen(skb) are 0.
>
> If software checksumming is enabled, xsk_skb_metadata() calls
> skb_checksum_help(skb). Since skb_headlen(skb) is 0, won't the check
> offset >= skb_headlen(skb) in skb_checksum_help() always evaluate to true,
> triggering a warning and causing the packet to be dropped?

Patch 1 has already dealt with this issue.

>
> Also, can a user-space program modify the checksum metadata concurrently?
>
> In xsk_skb_metadata(), the metadata is read directly from the UMEM buffer,
> which is shared memory mapped into user-space:
>
> net/xdp/xsk.c:xsk_skb_metadata() {
>     ...
>         if (meta->flags & XDP_TXMD_FLAGS_CHECKSUM) {
>                 if (unlikely(meta->request.csum_start +
>                              meta->request.csum_offset +
>                              sizeof(__sum16) > desc->len))
>                         return -EINVAL;
>
>                 skb->csum_start = hr + meta->request.csum_start;
>                 skb->csum_offset = meta->request.csum_offset;
>     ...
> }
>
> The bounds check does not use READ_ONCE(). Could a user-space application
> modify meta->request.csum_start and meta->request.csum_offset between the
> check and the assignment? If so, it might bypass the bounds check and specify
> arbitrarily large offsets, causing out-of-bounds writes when the hardware
> inserts the checksum.

Yes, you're right about this pre-existing behavior but currently I
don't have any interest in expanding the series. Maybe I will
construct it as a follow-up patch.

>
> > -                     if (unlikely(err))
> > +                     if (unlikely(err)) {
> > +                             xsk_drop_untrans_skb(skb);
> >                               return ERR_PTR(err);
> > +                     }
> >               }
> >       } else {
> >               struct xsk_addrs *xsk_addr;
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-21  9:40   ` sashiko-bot
@ 2026-04-21 13:01     ` Jason Xing
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Xing @ 2026-04-21 13:01 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 21, 2026 at 5:40 PM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [High] On 32-bit systems, xsk_drop_untrans_skb() will call kmem_cache_free() on an uninitialized destructor_arg pointer in early allocation failure paths, causing slab corruption.
> --
>
> commit cf643670755512ad76d19d6d0707e86a2a05c671
> Author: Jason Xing <kerneljasonxing@gmail.com>
>
> xsk: fix u64 descriptor address truncation on 32-bit architectures
>
> This commit fixes a truncation issue of the 64-bit descriptor address
> on 32-bit architectures by allocating an xsk_addrs struct to store the
> full address instead of packing it into the destructor_arg pointer.
> It also updates the related cleanup and initialization functions to
> handle the newly allocated struct.
>
> Link: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 9c06a4e868725..0232559d5348f 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> [ ... ]
> > @@ -799,6 +825,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> >
> >  static void xsk_drop_untrans_skb(struct sk_buff *skb)
> >  {
> > +     if (!IS_ENABLED(CONFIG_64BIT) && !xsk_skb_destructor_is_addr(skb)) {
> > +             struct xsk_addrs *xsk_addr;
> > +
> > +             xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > +             kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
> > +     }
>
> Could this pass an uninitialized pointer to kmem_cache_free() and cause slab
> corruption?
>
> When a new socket buffer is allocated via sock_alloc_send_skb(), the
> networking core's __alloc_skb() function only zeroes the skb_shared_info
> struct up to the dataref field. The destructor_arg field resides after
> dataref and contains uninitialized garbage from the slab.
>
> If an error occurs before destructor_arg is explicitly initialized, such as
> when xsk_skb_init_misc() returns -ENOMEM:
>
> xsk_build_skb_zerocopy()
>     ...
>     skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
>     ...
>     err = xsk_skb_init_misc(skb, xs, desc->addr);
>     if (unlikely(err)) {
>         xsk_drop_untrans_skb(skb);
>         return ERR_PTR(err);
>     }
>
> In xsk_drop_untrans_skb() on 32-bit systems, xsk_skb_destructor_is_addr(skb)
> evaluates to !skb_shinfo(skb)->destructor_arg. Because destructor_arg
> contains non-zero garbage, the function returns false. The check in
> xsk_drop_untrans_skb() (!xsk_skb_destructor_is_addr(skb)) then succeeds,
> passing the garbage pointer to kmem_cache_free().

Yes, you're right. After I removed the untrans_skb code block as
suggested by Stan, the problem is gone :)

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420082805.14844-1-kerneljasonxing@gmail.com?part=8

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-20 23:51     ` Jason Xing
@ 2026-04-21 22:20       ` Stanislav Fomichev
  0 siblings, 0 replies; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-21 22:20 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev, Jason Xing

On 04/21, Jason Xing wrote:
> On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
> >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > skb_checksum_help() is a common helper that writes the folded
> > > 16-bit checksum back via skb->data + csum_start + csum_offset,
> > > i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> > > and -EINVAL) when skb_headlen() is 0.
> > >
> > > AF_XDP generic xmit takes two very different paths depending on the
> > > netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> > > skip the "copy payload into a linear head" step on purpose as a
> > > performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> > > pages as frags and never calls skb_put(), so skb_headlen() stays 0
> > > for the whole skb. For these skbs there is simply no linear area for
> > > skb_checksum_help() to write the csum into - the sw-csum fallback is
> > > structurally inapplicable.
> > >
> > > The patch tries to catch this and reject the combination with error at
> > > setup time. Rejecting at bind() converts this silent per-packet failure
> > > into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> > > launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> > > because they do not call skb_checksum_help().
> > >
> > > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > > 2) sendmsg() returning -EINVAL without consuming the descriptor
> > >    (invalid_descs is not incremented),
> > > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> > >     consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> > >     the same descriptor and re-hits the same WARN until the socket
> > >     is closed.
> > >
> > > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> > > Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/xdp/xsk_buff_pool.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > index 37b7a68b89b3..c2521b6547e3 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > >       if (force_zc && force_copy)
> > >               return -EINVAL;
> > >
> > > +     if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > > +             return -EOPNOTSUPP;
> > > +
> > >       if (xsk_get_pool_from_qid(netdev, queue_id))
> > >               return -EBUSY;
> > >
> > > --
> > > 2.41.3
> > >
> >
> > Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
> > to calculate TX checksum in SW")?
> >
> > Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> 
> Thanks for the check. But not really. It is the commit 30c3055f9c0d
> that brings the csum support of IFF_TX_SKB_NO_LINEAR case where this
> issue can be triggered (because this mode no longer puts data into skb
> linear area).

Ah, right, good point, it's the IFF_TX_SKB_NO_LINEAR path.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-21  0:49     ` Jason Xing
@ 2026-04-21 22:23       ` Stanislav Fomichev
  0 siblings, 0 replies; 31+ messages in thread
From: Stanislav Fomichev @ 2026-04-21 22:23 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, bpf, netdev, Jason Xing

On 04/21, Jason Xing wrote:
> On Tue, Apr 21, 2026 at 3:49 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
> >
> > On 04/20, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> > > descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> > > uintptr_t cast:
> > >
> > >     skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> > >
> > > On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> > > the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> > > mode the chunk offset is encoded in bits 48-63 of the descriptor
> > > address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> > > lost entirely. The completion queue then returns a truncated address to
> > > userspace, making buffer recycling impossible.
> > >
> > > Fix this by handling the 32-bit case directly in
> > > xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> > > struct (the same path already used for multi-descriptor SKBs) to store
> > > the full u64 address.
> >
> > Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does
> 
> Of course, it would be super easy. Actually the initial version looks
> like this. One line of coder is simply enough.
> 
> > anybody seriously run af_xdp on 32 bit systems?
> 
> But my worry as you guess is if there exists a 32 bit system? I doubt
> it. That's why I put some effort into adding the compatibility code to
> cover the case. Good news is that it doesn't add any side effects of
> the 64 bit system since they are protected under IS_ENABLED condition.

If someone complains, we can follow up? af_xdp is all u64 everywhere,
including uapi, I doubt someone is using it on 32 bit systems. We don't
test this part on 32 bit systems on nipa either..

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-04-21 22:23 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20  8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-04-20  8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
2026-04-20 19:34   ` Stanislav Fomichev
2026-04-20 23:51     ` Jason Xing
2026-04-21 22:20       ` Stanislav Fomichev
2026-04-21  9:40   ` sashiko-bot
2026-04-21 12:39     ` Jason Xing
2026-04-20  8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
2026-04-20 19:34   ` Stanislav Fomichev
2026-04-21  9:40   ` sashiko-bot
2026-04-21 12:46     ` Jason Xing
2026-04-20  8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
2026-04-20 19:34   ` Stanislav Fomichev
2026-04-21  0:01     ` Jason Xing
2026-04-21  9:40   ` sashiko-bot
2026-04-21 12:51     ` Jason Xing
2026-04-20  8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
2026-04-20 19:34   ` Stanislav Fomichev
2026-04-21  0:51     ` Jason Xing
2026-04-20  8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
2026-04-21  9:40   ` sashiko-bot
2026-04-21 12:58     ` Jason Xing
2026-04-20  8:28 ` [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
2026-04-20  8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
2026-04-20 19:58   ` Stanislav Fomichev
2026-04-20  8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
2026-04-20 19:49   ` Stanislav Fomichev
2026-04-21  0:49     ` Jason Xing
2026-04-21 22:23       ` Stanislav Fomichev
2026-04-21  9:40   ` sashiko-bot
2026-04-21 13:01     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox