* [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
@ 2026-05-02 20:07 Jason Xing
2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
` (8 more replies)
0 siblings, 9 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
There are rare issues around xsk_build_skb(). Some of them
were founded by Sashiko[1][2].
[1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
[2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
---
v5
Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
1. reword patch 3
2. adjust the order of patch 6 that is now moved up to be patch 2)
3. use helper in patch (Stan)
v4
Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
2. add acked-by tags (Stan)
v3
Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
2. move init skb after it's allocated successfully, so patch 4,5,6 have
been adjusted accordingly (Stan)
3. do not support 32-bit arch (Stan)
4. add acked-by tags (Stan)
v2
Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
1. add four patches spotted by sashiko to fix buggy pre-existing
behavior
2. adjust the order of 8 patches.
Jason Xing (8):
xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
xsk: handle NULL dereference of the skb without frags issue
xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
xsk: avoid skb leak in XDP_TX_METADATA case
xsk: fix xsk_addrs slab leak on multi-buffer error path
xsk: fix u64 descriptor address truncation on 32-bit architectures
net/xdp/xsk.c | 115 ++++++++++++++++++++++++++--------------
net/xdp/xsk_buff_pool.c | 3 ++
2 files changed, 77 insertions(+), 41 deletions(-)
--
2.41.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
` (7 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.
AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.
The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().
Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
(invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
the same descriptor and re-hits the same WARN until the socket
is closed.
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk_buff_pool.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..c2521b6547e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
if (force_zc && force_copy)
return -EINVAL;
+ if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+ return -EOPNOTSUPP;
+
if (xsk_get_pool_from_qid(netdev, queue_id))
return -EBUSY;
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
` (6 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
is why bug can be triggered.
4. there is no chance to free this skb anymore.
Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..46660943c81e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -776,8 +776,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
addr = buffer - pool->addrs;
for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
- if (unlikely(i >= MAX_SKB_FRAGS))
+ if (unlikely(i >= MAX_SKB_FRAGS)) {
+ if (!xs->skb)
+ kfree_skb(skb);
return ERR_PTR(-EOVERFLOW);
+ }
page = pool->umem->pgs[addr >> PAGE_SHIFT];
get_page(page);
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
` (5 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy() (e.g., MAX_SKB_FRAGS exceeded), the
free_err -EOVERFLOW handler unconditionally dereferences xs->skb
via xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing
a NULL pointer dereference.
Fix this by guarding the existing xsk_inc_num_desc()/xsk_drop_skb()
calls with an xs->skb check (for the continuation case), and add
an else branch for the first-descriptor case that manually cancels
the one reserved CQ slot and increments invalid_descs by one to
account for the single invalid descriptor.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 46660943c81e..ff1eade29aa6 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -896,9 +896,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
kfree_skb(skb);
if (err == -EOVERFLOW) {
- /* Drop the packet */
- xsk_inc_num_desc(xs->skb);
- xsk_drop_skb(xs->skb);
+ if (xs->skb) {
+ /* Drop the packet */
+ xsk_inc_num_desc(xs->skb);
+ xsk_drop_skb(xs->skb);
+ } else {
+ xsk_cq_cancel_locked(xs->pool, 1);
+ xs->tx->invalid_descs++;
+ }
xskq_cons_release(xs->tx);
} else {
/* Let application retry */
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (2 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
` (4 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.
If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:
if (skb && !skb_shinfo(skb)->nr_frags)
kfree_skb(skb);
evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.
Fix by using a !xs->skb check to handle first frag situation, ensuring
we only free skbs that were freshly allocated in this call
(xs->skb is NULL) and never free an in-progress multi-buffer skb that
the caller still references.
Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ff1eade29aa6..ae59d1c1d2f8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -892,7 +892,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
return skb;
free_err:
- if (skb && !skb_shinfo(skb)->nr_frags)
+ if (skb && !xs->skb)
kfree_skb(skb);
if (err == -EOVERFLOW) {
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (3 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
` (3 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.
Postpone the init phase when we believe the allocation of first frag is
successfully completed. Before this init, skb can be safely freed by
kfree_skb().
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ae59d1c1d2f8..13bacc11fa9d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -739,8 +739,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
return ERR_PTR(err);
skb_reserve(skb, hr);
-
- xsk_skb_init_misc(skb, xs, desc->addr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
if (unlikely(err))
@@ -837,7 +835,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
if (unlikely(err))
goto free_err;
- xsk_skb_init_misc(skb, xs, desc->addr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc,
xs->pool, hr);
@@ -887,6 +884,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
}
+ if (!xs->skb)
+ xsk_skb_init_misc(skb, xs, desc->addr);
xsk_inc_num_desc(skb);
return skb;
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (4 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
` (2 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.
Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 13bacc11fa9d..5e6326e076ab 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -741,8 +741,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
skb_reserve(skb, hr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
- if (unlikely(err))
+ if (unlikely(err)) {
+ kfree_skb(skb);
return ERR_PTR(err);
+ }
}
} else {
struct xsk_addrs *xsk_addr;
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (5 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
8 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).
If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1
If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:
if (unlikely(num_descs > 1))
kmem_cache_free(xsk_tx_generic_cache, destructor_arg);
Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.
Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e6326e076ab..ed96f6ec8ff2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
spin_lock_irqsave(&pool->cq_prod_lock, flags);
idx = xskq_get_prod(pool->cq);
- if (unlikely(num_descs > 1)) {
+ if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
for (i = 0; i < num_descs; i++) {
@@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
u32 num_descs = xsk_get_num_desc(skb);
struct xsk_addrs *xsk_addr;
- if (unlikely(num_descs > 1)) {
+ if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
}
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (6 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-05-02 20:07 ` Jason Xing
2026-05-04 14:59 ` Stanislav Fomichev
2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
8 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2026-05-02 20:07 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:
skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
mode the chunk offset is encoded in bits 48-63 of the descriptor
address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
lost entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.
Fix this by handling the 32-bit case directly in
xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an
xsk_addrs struct (the same path already used for multi-descriptor
SKBs) to store the full u64 address. The existing tagged-pointer logic
in xsk_skb_destructor_is_addr() stays unchanged: slab pointers returned
from kmem_cache_zalloc() are always word-aligned and therefore have
bit 0 clear, which correctly identifies them as a struct pointer
rather than an inline tagged address on every architecture.
Factor the shared kmem_cache_zalloc + destructor_arg assignment into
__xsk_addrs_alloc() and add a wrapper xsk_addrs_alloc() that handles
the inline-to-list upgrade (is_addr check + get_addr + num_descs = 1).
The three former open-coded kmem_cache_zalloc call sites now reduce to
a single call each.
Propagate the -ENOMEM from xsk_skb_destructor_set_addr() through
xsk_skb_init_misc() so the caller can clean up the skb via kfree_skb()
before skb->destructor is installed.
The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 88 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 56 insertions(+), 32 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ed96f6ec8ff2..6bcd77068e52 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -566,9 +566,42 @@ static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
}
-static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+static struct xsk_addrs *__xsk_addrs_alloc(struct sk_buff *skb, u64 addr)
{
- skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+ struct xsk_addrs *xsk_addr;
+
+ xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
+ if (unlikely(!xsk_addr))
+ return NULL;
+
+ xsk_addr->addrs[0] = addr;
+ skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+ return xsk_addr;
+}
+
+static struct xsk_addrs *xsk_addrs_alloc(struct sk_buff *skb)
+{
+ struct xsk_addrs *xsk_addr;
+
+ if (!xsk_skb_destructor_is_addr(skb))
+ return (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+
+ xsk_addr = __xsk_addrs_alloc(skb, xsk_skb_destructor_get_addr(skb));
+ if (likely(xsk_addr))
+ xsk_addr->num_descs = 1;
+ return xsk_addr;
+}
+
+static int xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+{
+ if (IS_ENABLED(CONFIG_64BIT)) {
+ skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+ return 0;
+ }
+
+ if (unlikely(!__xsk_addrs_alloc(skb, addr)))
+ return -ENOMEM;
+ return 0;
}
static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -644,14 +677,20 @@ void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
}
-static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
- u64 addr)
+static int xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
+ u64 addr)
{
+ int err;
+
+ err = xsk_skb_destructor_set_addr(skb, addr);
+ if (unlikely(err))
+ return err;
+
skb->dev = xs->dev;
skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
skb->destructor = xsk_destruct_skb;
- xsk_skb_destructor_set_addr(skb, addr);
+ return 0;
}
static void xsk_consume_skb(struct sk_buff *skb)
@@ -749,18 +788,9 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
} else {
struct xsk_addrs *xsk_addr;
- if (xsk_skb_destructor_is_addr(skb)) {
- xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
- GFP_KERNEL);
- if (!xsk_addr)
- return ERR_PTR(-ENOMEM);
-
- xsk_addr->num_descs = 1;
- xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
- skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
- } else {
- xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
- }
+ xsk_addr = xsk_addrs_alloc(skb);
+ if (!xsk_addr)
+ return ERR_PTR(-ENOMEM);
/* in case of -EOVERFLOW that could happen below,
* xsk_consume_skb() will release this node as whole skb
@@ -849,19 +879,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
struct page *page;
u8 *vaddr;
- if (xsk_skb_destructor_is_addr(skb)) {
- xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
- GFP_KERNEL);
- if (!xsk_addr) {
- err = -ENOMEM;
- goto free_err;
- }
-
- xsk_addr->num_descs = 1;
- xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
- skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
- } else {
- xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ xsk_addr = xsk_addrs_alloc(skb);
+ if (!xsk_addr) {
+ err = -ENOMEM;
+ goto free_err;
}
if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
@@ -886,8 +907,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
}
- if (!xs->skb)
- xsk_skb_init_misc(skb, xs, desc->addr);
+ if (!xs->skb) {
+ err = xsk_skb_init_misc(skb, xs, desc->addr);
+ if (unlikely(err))
+ goto free_err;
+ }
xsk_inc_num_desc(skb);
return skb;
--
2.41.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-05-04 14:59 ` Stanislav Fomichev
0 siblings, 0 replies; 12+ messages in thread
From: Stanislav Fomichev @ 2026-05-04 14:59 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
On 05/02, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
>
> In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> uintptr_t cast:
>
> skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
>
> On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> mode the chunk offset is encoded in bits 48-63 of the descriptor
> address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> lost entirely. The completion queue then returns a truncated address to
> userspace, making buffer recycling impossible.
>
> Fix this by handling the 32-bit case directly in
> xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an
> xsk_addrs struct (the same path already used for multi-descriptor
> SKBs) to store the full u64 address. The existing tagged-pointer logic
> in xsk_skb_destructor_is_addr() stays unchanged: slab pointers returned
> from kmem_cache_zalloc() are always word-aligned and therefore have
> bit 0 clear, which correctly identifies them as a struct pointer
> rather than an inline tagged address on every architecture.
>
> Factor the shared kmem_cache_zalloc + destructor_arg assignment into
> __xsk_addrs_alloc() and add a wrapper xsk_addrs_alloc() that handles
> the inline-to-list upgrade (is_addr check + get_addr + num_descs = 1).
> The three former open-coded kmem_cache_zalloc call sites now reduce to
> a single call each.
>
> Propagate the -ENOMEM from xsk_skb_destructor_set_addr() through
> xsk_skb_init_misc() so the caller can clean up the skb via kfree_skb()
> before skb->destructor is installed.
>
> The overhead is one extra kmem_cache_zalloc per first descriptor on
> 32-bit only; 64-bit builds are completely unchanged.
>
> Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
> Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
LGTM, thanks!
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (7 preceding siblings ...)
2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-05-05 15:44 ` Alexander Lobakin
2026-05-05 19:09 ` Jason Xing
8 siblings, 1 reply; 12+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:44 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Sat, 2 May 2026 23:07:14 +0300
> From: Jason Xing <kernelxing@tencent.com>
>
> There are rare issues around xsk_build_skb(). Some of them
> were founded by Sashiko[1][2].
For the series:
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Just don't forget to look at Sashiko's replies, I haven't looked deep
into them, so can't say whether anything there makes sense.
>
> [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
>
> ---
> v5
> Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
> 1. reword patch 3
> 2. adjust the order of patch 6 that is now moved up to be patch 2)
> 3. use helper in patch (Stan)
>
> v4
> Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
> 1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
> 2. add acked-by tags (Stan)
>
> v3
> Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
> 1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
> 2. move init skb after it's allocated successfully, so patch 4,5,6 have
> been adjusted accordingly (Stan)
> 3. do not support 32-bit arch (Stan)
> 4. add acked-by tags (Stan)
>
> v2
> Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
> 1. add four patches spotted by sashiko to fix buggy pre-existing
> behavior
> 2. adjust the order of 8 patches.
>
>
>
> Jason Xing (8):
> xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
> xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
> xsk: handle NULL dereference of the skb without frags issue
> xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
> xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
> xsk: avoid skb leak in XDP_TX_METADATA case
> xsk: fix xsk_addrs slab leak on multi-buffer error path
> xsk: fix u64 descriptor address truncation on 32-bit architectures
>
> net/xdp/xsk.c | 115 ++++++++++++++++++++++++++--------------
> net/xdp/xsk_buff_pool.c | 3 ++
> 2 files changed, 77 insertions(+), 41 deletions(-)
Thanks,
Olek
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
@ 2026-05-05 19:09 ` Jason Xing
0 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2026-05-05 19:09 UTC (permalink / raw)
To: Alexander Lobakin
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
On Tue, May 5, 2026 at 6:45 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Sat, 2 May 2026 23:07:14 +0300
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > There are rare issues around xsk_build_skb(). Some of them
> > were founded by Sashiko[1][2].
>
> For the series:
>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>
> Just don't forget to look at Sashiko's replies, I haven't looked deep
> into them, so can't say whether anything there makes sense.
Thanks for the review. I'm currently away from the keyboard because of
the long vacation. Now I'll handle them one by one in the following
two days. Hopefully, I can manage it today :P
Thanks,
Jason
>
> >
> > [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> > [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
> >
> > ---
> > v5
> > Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
> > 1. reword patch 3
> > 2. adjust the order of patch 6 that is now moved up to be patch 2)
> > 3. use helper in patch (Stan)
> >
> > v4
> > Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
> > 1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
> > 2. add acked-by tags (Stan)
> >
> > v3
> > Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
> > 1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
> > 2. move init skb after it's allocated successfully, so patch 4,5,6 have
> > been adjusted accordingly (Stan)
> > 3. do not support 32-bit arch (Stan)
> > 4. add acked-by tags (Stan)
> >
> > v2
> > Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
> > 1. add four patches spotted by sashiko to fix buggy pre-existing
> > behavior
> > 2. adjust the order of 8 patches.
> >
> >
> >
> > Jason Xing (8):
> > xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
> > xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
> > xsk: handle NULL dereference of the skb without frags issue
> > xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
> > xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
> > xsk: avoid skb leak in XDP_TX_METADATA case
> > xsk: fix xsk_addrs slab leak on multi-buffer error path
> > xsk: fix u64 descriptor address truncation on 32-bit architectures
> >
> > net/xdp/xsk.c | 115 ++++++++++++++++++++++++++--------------
> > net/xdp/xsk_buff_pool.c | 3 ++
> > 2 files changed, 77 insertions(+), 41 deletions(-)
>
> Thanks,
> Olek
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-05 19:09 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-02 20:07 [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-05-02 20:07 ` [PATCH net v5 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
2026-05-02 20:07 ` [PATCH net v5 2/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
2026-05-02 20:07 ` [PATCH net v5 3/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
2026-05-02 20:07 ` [PATCH net v5 4/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
2026-05-02 20:07 ` [PATCH net v5 5/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
2026-05-02 20:07 ` [PATCH net v5 6/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
2026-05-02 20:07 ` [PATCH net v5 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
2026-05-02 20:07 ` [PATCH net v5 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
2026-05-04 14:59 ` Stanislav Fomichev
2026-05-05 15:44 ` [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation Alexander Lobakin
2026-05-05 19:09 ` Jason Xing
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox