public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation
@ 2026-04-24  5:38 Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

There are rare issues around xsk_build_skb(). Some of them
were founded by Sashiko[1][2].

[1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
[2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/

---
v4
Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
2. add acked-by tags (Stan)

v3
Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
2. move init skb after it's allocated successfully, so patch 4,5,6 have
   been adjusted accordingly (Stan)
3. do not support 32-bit arch (Stan)
4. add acked-by tags (Stan)

v2
Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
1. add four patches spotted by sashiko to fix buggy pre-existing
behavior
2. adjust the order of 8 patches.

Jason Xing (8):
  xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  xsk: handle NULL dereference of the skb without frags issue
  xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  xsk: avoid skb leak in XDP_TX_METADATA case
  xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  xsk: fix xsk_addrs slab leak on multi-buffer error path
  xsk: fix u64 descriptor address truncation on 32-bit architectures

 net/xdp/xsk.c           | 65 +++++++++++++++++++++++++++++++----------
 net/xdp/xsk_buff_pool.c |  3 ++
 2 files changed, 52 insertions(+), 16 deletions(-)

-- 
2.41.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
       [not found]   ` <20260425054032.28A27C2BCB2@smtp.kernel.org>
  2026-04-24  5:38 ` [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.

AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.

The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().

Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
   (invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
    consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
    the same descriptor and re-hits the same WARN until the socket
    is closed.

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk_buff_pool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..c2521b6547e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (force_zc && force_copy)
 		return -EINVAL;
 
+	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+		return -EOPNOTSUPP;
+
 	if (xsk_get_pool_from_qid(netdev, queue_id))
 		return -EBUSY;
 
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-28 11:33   ` Simon Horman
  2026-04-24  5:38 ` [PATCH net v4 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy (e.g., MAX_SKB_FRAGS exceeded), the free_err
EOVERFLOW handler unconditionally dereferences xs->skb via
xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL
pointer dereference.

In this series, the skb is already freed by kfree_skb() inside
xsk_build_skb_zerocopy for the first-descriptor case, so we only need
to do the bookkeeping: cancel the one reserved CQ slot and account for
the single invalid descriptor.

Guard the existing xsk_inc_num_desc/xsk_drop_skb calls with an
xs->skb check (for the continuation case), and add an else branch
for the first-descriptor case that manually cancels the CQ slot and
increments invalid_descs by one.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..6521604f8d42 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -893,9 +893,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-		/* Drop the packet */
-		xsk_inc_num_desc(xs->skb);
-		xsk_drop_skb(xs->skb);
+		if (xs->skb) {
+			/* Drop the packet */
+			xsk_inc_num_desc(xs->skb);
+			xsk_drop_skb(xs->skb);
+		} else {
+			xsk_cq_cancel_locked(xs->pool, 1);
+			xs->tx->invalid_descs++;
+		}
 		xskq_cons_release(xs->tx);
 	} else {
 		/* Let application retry */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.

If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:

  if (skb && !skb_shinfo(skb)->nr_frags)
      kfree_skb(skb);

evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.

Fix by using a !xs->skb check to handle first frag situation, ensuring
we only free skbs that were freshly allocated in this call
(xs->skb is NULL) and never free an in-progress multi-buffer skb that
the caller still references.

Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6521604f8d42..d23d1b14b8b4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return skb;
 
 free_err:
-	if (skb && !skb_shinfo(skb)->nr_frags)
+	if (skb && !xs->skb)
 		kfree_skb(skb);
 
 	if (err == -EOVERFLOW) {
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (2 preceding siblings ...)
  2026-04-24  5:38 ` [PATCH net v4 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.

Postpone the init phase when we believe the allocation of first frag is
successfully completed. Before this init, skb can be safely freed by
kfree_skb().

Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index d23d1b14b8b4..88ec6d2cbbcf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -739,8 +739,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-
-		xsk_skb_init_misc(skb, xs, desc->addr);
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
 			if (unlikely(err))
@@ -834,7 +832,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			if (unlikely(err))
 				goto free_err;
 
-			xsk_skb_init_misc(skb, xs, desc->addr);
 			if (desc->options & XDP_TX_METADATA) {
 				err = xsk_skb_metadata(skb, buffer, desc,
 						       xs->pool, hr);
@@ -884,6 +881,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		}
 	}
 
+	if (!xs->skb)
+		xsk_skb_init_misc(skb, xs, desc->addr);
 	xsk_inc_num_desc(skb);
 
 	return skb;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 5/8] xsk: avoid skb leak in XDP_TX_METADATA case
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (3 preceding siblings ...)
  2026-04-24  5:38 ` [PATCH net v4 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
   NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.

Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 88ec6d2cbbcf..c49b58199d2f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -741,8 +741,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 		skb_reserve(skb, hr);
 		if (desc->options & XDP_TX_METADATA) {
 			err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
-			if (unlikely(err))
+			if (unlikely(err)) {
+				kfree_skb(skb);
 				return ERR_PTR(err);
+			}
 		}
 	} else {
 		struct xsk_addrs *xsk_addr;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (4 preceding siblings ...)
  2026-04-24  5:38 ` [PATCH net v4 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  7 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Fix it by explicitly adding kfree_skb() before returning back to its
caller.

How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
   hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
   is why bug can be triggered.
4. there is no chance to free this skb anymore.

Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c49b58199d2f..5e6326e076ab 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -776,8 +776,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 	addr = buffer - pool->addrs;
 
 	for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
-		if (unlikely(i >= MAX_SKB_FRAGS))
+		if (unlikely(i >= MAX_SKB_FRAGS)) {
+			if (!xs->skb)
+				kfree_skb(skb);
 			return ERR_PTR(-EOVERFLOW);
+		}
 
 		page = pool->umem->pgs[addr >> PAGE_SHIFT];
 		get_page(page);
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (5 preceding siblings ...)
  2026-04-24  5:38 ` [PATCH net v4 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-24  5:38 ` [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
  7 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).

If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1

If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:

    if (unlikely(num_descs > 1))
        kmem_cache_free(xsk_tx_generic_cache, destructor_arg);

Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.

Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e6326e076ab..ed96f6ec8ff2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
 	spin_lock_irqsave(&pool->cq_prod_lock, flags);
 	idx = xskq_get_prod(pool->cq);
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 
 		for (i = 0; i < num_descs; i++) {
@@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
 	u32 num_descs = xsk_get_num_desc(skb);
 	struct xsk_addrs *xsk_addr;
 
-	if (unlikely(num_descs > 1)) {
+	if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
 		xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
 		kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
 	}
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
                   ` (6 preceding siblings ...)
  2026-04-24  5:38 ` [PATCH net v4 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-04-24  5:38 ` Jason Xing
  2026-04-28 13:18   ` Paolo Abeni
  7 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2026-04-24  5:38 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:

    skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);

On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In unaligned mode the chunk
offset is encoded in bits 48-63 of the descriptor address
(XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is lost
entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.

Fix this by handling the 32-bit case in the destructor_arg helpers:

- xsk_skb_destructor_set_addr(): on !CONFIG_64BIT, allocate an
  xsk_addrs struct via kmem_cache_zalloc() to store the full u64
  address. Leave num_descs as 0 (zalloc) so that the subsequent
  xsk_inc_num_desc() brings it to the correct count of 1.

- xsk_skb_destructor_is_addr(): on !CONFIG_64BIT, return true only
  when destructor_arg is NULL (not yet set), false when it points to
  an xsk_addrs struct.

- xsk_skb_init_misc(): call xsk_skb_destructor_set_addr() first
  before touching any other skb fields; on failure return early so
  the skb destructor is never changed from sock_wfree.

The existing xsk_consume_skb() already handles 32-bit correctly after
these changes: xsk_skb_destructor_is_addr() returns false for any
allocated xsk_addrs, so the kmem_cache_free path is always taken.

The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.

Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ed96f6ec8ff2..fe88f47741b5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -558,7 +558,10 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
 
 static bool xsk_skb_destructor_is_addr(struct sk_buff *skb)
 {
-	return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+	if (IS_ENABLED(CONFIG_64BIT))
+		return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+	else
+		return !skb_shinfo(skb)->destructor_arg;
 }
 
 static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
@@ -566,9 +569,21 @@ static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
 	return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
 }
 
-static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+static int xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
 {
+	if (!IS_ENABLED(CONFIG_64BIT)) {
+		struct xsk_addrs *xsk_addr;
+
+		xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
+		if (!xsk_addr)
+			return -ENOMEM;
+		xsk_addr->addrs[0] = addr;
+		skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+		return 0;
+	}
+
 	skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+	return 0;
 }
 
 static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -644,14 +659,20 @@ void xsk_destruct_skb(struct sk_buff *skb)
 	sock_wfree(skb);
 }
 
-static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
-			      u64 addr)
+static int xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
+			     u64 addr)
 {
+	int err;
+
+	err = xsk_skb_destructor_set_addr(skb, addr);
+	if (err)
+		return err;
+
 	skb->dev = xs->dev;
 	skb->priority = READ_ONCE(xs->sk.sk_priority);
 	skb->mark = READ_ONCE(xs->sk.sk_mark);
 	skb->destructor = xsk_destruct_skb;
-	xsk_skb_destructor_set_addr(skb, addr);
+	return 0;
 }
 
 static void xsk_consume_skb(struct sk_buff *skb)
@@ -886,8 +907,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		}
 	}
 
-	if (!xs->skb)
-		xsk_skb_init_misc(skb, xs, desc->addr);
+	if (!xs->skb) {
+		err = xsk_skb_init_misc(skb, xs, desc->addr);
+		if (unlikely(err))
+			goto free_err;
+	}
 	xsk_inc_num_desc(skb);
 
 	return skb;
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue
  2026-04-24  5:38 ` [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-28 11:33   ` Simon Horman
  0 siblings, 0 replies; 12+ messages in thread
From: Simon Horman @ 2026-04-28 11:33 UTC (permalink / raw)
  To: kerneljasonxing
  Cc: 'Simon Horman', davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, maciej.fijalkowski, jonathan.lemon, sdf, ast,
	daniel, hawk, john.fastabend, aleksander.lobakin, bpf, netdev,
	kernelxing

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/

Simon says:

Hi Jason,

I'm providing this for informational purposes.
My unedrstanding is that the problem flagged below
is not present with the full 8 patches of the series
applied.

---
xsk: handle NULL dereference of the skb without frags issue

This change guards the existing xsk_inc_num_desc()/xsk_drop_skb() calls
in xsk_build_skb()'s free_err path with an xs->skb check, and adds a new
else branch that cancels the reserved CQ slot and bumps invalid_descs
when xs->skb is NULL.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 887abed25466..54d4895fc11e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -973,9 +973,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  		kfree_skb(skb);
>  
>  	if (err == -EOVERFLOW) {
> -		/* Drop the packet */
> -		xsk_inc_num_desc(xs->skb);
> -		xsk_drop_skb(xs->skb);
> +		if (xs->skb) {
> +			/* Drop the packet */
> +			xsk_inc_num_desc(xs->skb);
> +			xsk_drop_skb(xs->skb);
> +		} else {
> +			xsk_cq_cancel_locked(xs->pool, 1);
> +			xs->tx->invalid_descs++;
> +		}

The changelog states:

    In this series, the skb is already freed by kfree_skb() inside
    xsk_build_skb_zerocopy for the first-descriptor case, so we only need
    to do the bookkeeping: cancel the one reserved CQ slot and account
    for the single invalid descriptor.

Is that accurate for this commit?

Looking at xsk_build_skb_zerocopy() as it stands here, neither of its
error returns actually frees the skb:

    if (!skb) {
            ...
            skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
            if (unlikely(!skb))
                    return ERR_PTR(err);

            skb_reserve(skb, hr);

            xsk_skb_init_misc(skb, xs, desc->addr);
            if (desc->options & XDP_TX_METADATA) {
                    err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
                    if (unlikely(err))
                            return ERR_PTR(err);
            }
    }
    ...
    for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
            if (unlikely(i >= MAX_SKB_FRAGS))
                    return ERR_PTR(-EOVERFLOW);

            page = pool->umem->pgs[addr >> PAGE_SHIFT];
            get_page(page);
            ...
            skb_fill_page_desc(skb, i, page, offset, copy);
            ...
    }

So on the first-descriptor -EOVERFLOW path (xs->skb == NULL,
IFF_TX_SKB_NO_LINEAR device), doesn't xsk_build_skb_zerocopy() return
with the skb still allocated, up to MAX_SKB_FRAGS pages pinned via
get_page() and attached through skb_fill_page_desc(), and an
sk_wmem_alloc charge held by xsk_destruct_skb (installed by
xsk_skb_init_misc)?

Control then returns to xsk_build_skb():

    skb = xsk_build_skb_zerocopy(xs, desc);
    if (IS_ERR(skb)) {
            err = PTR_ERR(skb);
            skb = NULL;
            goto free_err;
    }
    ...
free_err:
    if (skb && !skb_shinfo(skb)->nr_frags)
            kfree_skb(skb);

Since the local skb has been set to NULL, the kfree_skb() guard is
skipped, and the new else-branch only performs the CQ cancel and
invalid_descs bump.

Doesn't that leak the sk_buff, the attached frag pages, and the
sk_wmem_alloc charge every time a first descriptor hits the i >=
MAX_SKB_FRAGS condition (or the xsk_skb_metadata failure)?

A later patch in the series, "xsk: free the skb when hitting the upper
bound MAX_SKB_FRAGS", adds an "if (!xs->skb) kfree_skb(skb);" inside
xsk_build_skb_zerocopy() before the -EOVERFLOW return, which seems to
confirm the skb is not freed at this commit. Should the changelog's
premise be reworked, or the free added here so each patch in the series
stands on its own?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
       [not found]   ` <20260425054032.28A27C2BCB2@smtp.kernel.org>
@ 2026-04-28 13:13     ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2026-04-28 13:13 UTC (permalink / raw)
  To: Jason Xing; +Cc: bpf, netdev@vger.kernel.org, sashiko

On 4/25/26 7:40 AM, sashiko-bot@kernel.org wrote:
> Sashiko AI review found 1 potential issue(s):
> - [High] The patch incompletely fixes the WARN_ON_ONCE in skb_checksum_help() triggered by pure-fragment SKBs. It successfully prevents the explicit fallback in xsk_skb_metadata() (which wedged the TX ring), but leaves the implicit generic network stack fallback vulnerable. If a user sets XDP_TXMD_FLAGS_CHECKSUM on an IFF_TX_SKB_NO_LINEAR device and the hardware checksum offload is disabled at runtime (e.g., via `ethtool -K eth0 tx off`), the core stack's validate_xmit_skb() takes over. Because NETIF_F_SG is allowed to remain active even without checksum features, skb_needs_linearize() evaluates to false, leaving the SKB unlinearized. The SKB is then passed to skb_csum_hwoffload_help() -> skb_checksum_help(), which hits the exact same WARN_ON_ONCE(offset >= skb_headlen(skb)) because skb_headlen is 0.
> --
> 
> commit 86ab0a4900aa60bee0c3df7393a2846b5ff74ccb
> Author: Jason Xing <kerneljasonxing@gmail.com>
> 
> xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
> 
> This commit attempts to fix a warning triggered in skb_checksum_help()
> by AF_XDP pure-fragment SKBs. It rejects sw-csum UMEM binding for
> drivers advertising IFF_TX_SKB_NO_LINEAR at setup time.
> 
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> 
>> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
>> index cd7bc50872f6b5..d981cfdd853578 100644
>> --- a/net/xdp/xsk_buff_pool.c
>> +++ b/net/xdp/xsk_buff_pool.c
>> @@ -175,6 +175,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>>  	if (force_zc && force_copy)
>>  		return -EINVAL;
>>  
>> +	if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
>> +		return -EOPNOTSUPP;
>> +
> 
> Does this still leave a path for the warning in skb_checksum_help() to
> trigger if hardware checksum offloading is disabled at runtime?
> 
> If a user sets XDP_TXMD_FLAGS_CHECKSUM on an IFF_TX_SKB_NO_LINEAR device,
> and hardware checksum offload is later disabled (e.g., via ethtool -K
> eth0 tx off), the core stack's validate_xmit_skb() will take over.
> 
> Because NETIF_F_SG can remain active even without checksum features,
> skb_needs_linearize() evaluates to false, leaving the SKB unlinearized.
> 
> The SKB is then passed to skb_csum_hwoffload_help() which falls back to
> skb_checksum_help(). Since the pure-fragment SKB still has a headlen of 0,
> won't this hit the exact same WARN_ON_ONCE(offset >= skb_headlen(skb))
> warning?

My take here is the issue describe above should be handled separately.
(And on top of my head I don't see an easy way to address it short of an
additional check in the tx path, which I think we want to avoid).

/P


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
  2026-04-24  5:38 ` [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-04-28 13:18   ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2026-04-28 13:18 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, aleksander.lobakin
  Cc: bpf, netdev, Jason Xing

On 4/24/26 7:38 AM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> uintptr_t cast:
> 
>     skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> 
> On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> the descriptor address are silently dropped. In unaligned mode the chunk
> offset is encoded in bits 48-63 of the descriptor address
> (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is lost
> entirely. The completion queue then returns a truncated address to
> userspace, making buffer recycling impossible.
> 
> Fix this by handling the 32-bit case in the destructor_arg helpers:
> 
> - xsk_skb_destructor_set_addr(): on !CONFIG_64BIT, allocate an
>   xsk_addrs struct via kmem_cache_zalloc() to store the full u64
>   address. Leave num_descs as 0 (zalloc) so that the subsequent
>   xsk_inc_num_desc() brings it to the correct count of 1.
> 
> - xsk_skb_destructor_is_addr(): on !CONFIG_64BIT, return true only
>   when destructor_arg is NULL (not yet set), false when it points to
>   an xsk_addrs struct.
> 
> - xsk_skb_init_misc(): call xsk_skb_destructor_set_addr() first
>   before touching any other skb fields; on failure return early so
>   the skb destructor is never changed from sock_wfree.
> 
> The existing xsk_consume_skb() already handles 32-bit correctly after
> these changes: xsk_skb_destructor_is_addr() returns false for any
> allocated xsk_addrs, so the kmem_cache_free path is always taken.
> 
> The overhead is one extra kmem_cache_zalloc per first descriptor on
> 32-bit only; 64-bit builds are completely unchanged.
> 
> Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
> Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

LGTM, but waiting a bit more for Magnus, Maciej or Stan's ack.

/P


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-28 13:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24  5:38 [PATCH net v4 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-04-24  5:38 ` [PATCH net v4 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
     [not found]   ` <20260425054032.28A27C2BCB2@smtp.kernel.org>
2026-04-28 13:13     ` Paolo Abeni
2026-04-24  5:38 ` [PATCH net v4 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
2026-04-28 11:33   ` Simon Horman
2026-04-24  5:38 ` [PATCH net v4 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
2026-04-24  5:38 ` [PATCH net v4 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
2026-04-24  5:38 ` [PATCH net v4 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
2026-04-24  5:38 ` [PATCH net v4 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
2026-04-24  5:38 ` [PATCH net v4 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
2026-04-24  5:38 ` [PATCH net v4 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
2026-04-28 13:18   ` Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox