* [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
@ 2026-04-20 8:27 ` Jason Xing
2026-04-20 19:34 ` Stanislav Fomichev
2026-04-20 8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
` (6 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:27 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.
AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.
The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().
Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
(invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
the same descriptor and re-hits the same WARN until the socket
is closed.
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk_buff_pool.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..c2521b6547e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
if (force_zc && force_copy)
return -EINVAL;
+ if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
+ return -EOPNOTSUPP;
+
if (xsk_get_pool_from_qid(netdev, queue_id))
return -EBUSY;
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
2026-04-20 8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-04-20 19:34 ` Stanislav Fomichev
2026-04-20 23:51 ` Jason Xing
0 siblings, 1 reply; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
To: Jason Xing; +Cc: bpf, netdev, Jason Xing
> From: Jason Xing <kernelxing@tencent.com>
>
> skb_checksum_help() is a common helper that writes the folded
> 16-bit checksum back via skb->data + csum_start + csum_offset,
> i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> and -EINVAL) when skb_headlen() is 0.
>
> AF_XDP generic xmit takes two very different paths depending on the
> netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> skip the "copy payload into a linear head" step on purpose as a
> performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> pages as frags and never calls skb_put(), so skb_headlen() stays 0
> for the whole skb. For these skbs there is simply no linear area for
> skb_checksum_help() to write the csum into - the sw-csum fallback is
> structurally inapplicable.
>
> The patch tries to catch this and reject the combination with error at
> setup time. Rejecting at bind() converts this silent per-packet failure
> into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> because they do not call skb_checksum_help().
>
> Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> XDP_TXMD_FLAGS_CHECKSUM' produces:
> 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> 2) sendmsg() returning -EINVAL without consuming the descriptor
> (invalid_descs is not incremented),
> 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> the same descriptor and re-hits the same WARN until the socket
> is closed.
>
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> net/xdp/xsk_buff_pool.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 37b7a68b89b3..c2521b6547e3 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> if (force_zc && force_copy)
> return -EINVAL;
>
> + if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> + return -EOPNOTSUPP;
> +
> if (xsk_get_pool_from_qid(netdev, queue_id))
> return -EBUSY;
>
> --
> 2.41.3
>
Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
to calculate TX checksum in SW")?
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
2026-04-20 19:34 ` Stanislav Fomichev
@ 2026-04-20 23:51 ` Jason Xing
2026-04-21 22:20 ` Stanislav Fomichev
0 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 23:51 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing
On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > skb_checksum_help() is a common helper that writes the folded
> > 16-bit checksum back via skb->data + csum_start + csum_offset,
> > i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> > and -EINVAL) when skb_headlen() is 0.
> >
> > AF_XDP generic xmit takes two very different paths depending on the
> > netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> > skip the "copy payload into a linear head" step on purpose as a
> > performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> > pages as frags and never calls skb_put(), so skb_headlen() stays 0
> > for the whole skb. For these skbs there is simply no linear area for
> > skb_checksum_help() to write the csum into - the sw-csum fallback is
> > structurally inapplicable.
> >
> > The patch tries to catch this and reject the combination with error at
> > setup time. Rejecting at bind() converts this silent per-packet failure
> > into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> > launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> > because they do not call skb_checksum_help().
> >
> > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > 2) sendmsg() returning -EINVAL without consuming the descriptor
> > (invalid_descs is not incremented),
> > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> > consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> > the same descriptor and re-hits the same WARN until the socket
> > is closed.
> >
> > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> > Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > net/xdp/xsk_buff_pool.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index 37b7a68b89b3..c2521b6547e3 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > if (force_zc && force_copy)
> > return -EINVAL;
> >
> > + if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > + return -EOPNOTSUPP;
> > +
> > if (xsk_get_pool_from_qid(netdev, queue_id))
> > return -EBUSY;
> >
> > --
> > 2.41.3
> >
>
> Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
> to calculate TX checksum in SW")?
>
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Thanks for the check. But not really. It is the commit 30c3055f9c0d
that brings the csum support of IFF_TX_SKB_NO_LINEAR case where this
issue can be triggered (because this mode no longer puts data into skb
linear area).
Thanks,
Jason
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
2026-04-20 23:51 ` Jason Xing
@ 2026-04-21 22:20 ` Stanislav Fomichev
0 siblings, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-21 22:20 UTC (permalink / raw)
To: Jason Xing; +Cc: bpf, netdev, Jason Xing
On 04/21, Jason Xing wrote:
> On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
> >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > skb_checksum_help() is a common helper that writes the folded
> > > 16-bit checksum back via skb->data + csum_start + csum_offset,
> > > i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> > > and -EINVAL) when skb_headlen() is 0.
> > >
> > > AF_XDP generic xmit takes two very different paths depending on the
> > > netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> > > skip the "copy payload into a linear head" step on purpose as a
> > > performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> > > pages as frags and never calls skb_put(), so skb_headlen() stays 0
> > > for the whole skb. For these skbs there is simply no linear area for
> > > skb_checksum_help() to write the csum into - the sw-csum fallback is
> > > structurally inapplicable.
> > >
> > > The patch tries to catch this and reject the combination with error at
> > > setup time. Rejecting at bind() converts this silent per-packet failure
> > > into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> > > launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> > > because they do not call skb_checksum_help().
> > >
> > > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > > 2) sendmsg() returning -EINVAL without consuming the descriptor
> > > (invalid_descs is not incremented),
> > > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> > > consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> > > the same descriptor and re-hits the same WARN until the socket
> > > is closed.
> > >
> > > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> > > Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > net/xdp/xsk_buff_pool.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > index 37b7a68b89b3..c2521b6547e3 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > > if (force_zc && force_copy)
> > > return -EINVAL;
> > >
> > > + if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > > + return -EOPNOTSUPP;
> > > +
> > > if (xsk_get_pool_from_qid(netdev, queue_id))
> > > return -EBUSY;
> > >
> > > --
> > > 2.41.3
> > >
> >
> > Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
> > to calculate TX checksum in SW")?
> >
> > Acked-by: Stanislav Fomichev <sdf@fomichev.me>
>
> Thanks for the check. But not really. It is the commit 30c3055f9c0d
> that brings the csum support of IFF_TX_SKB_NO_LINEAR case where this
> issue can be triggered (because this mode no longer puts data into skb
> linear area).
Ah, right, good point, it's the IFF_TX_SKB_NO_LINEAR path.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-04-20 8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
@ 2026-04-20 8:27 ` Jason Xing
2026-04-20 19:34 ` Stanislav Fomichev
2026-04-20 8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
` (5 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:27 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy (e.g., MAX_SKB_FRAGS exceeded), the free_err
EOVERFLOW handler unconditionally dereferences xs->skb via
xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL
pointer dereference.
In this series, the skb is already freed by kfree_skb() inside
xsk_build_skb_zerocopy for the first-descriptor case, so we only need
to do the bookkeeping: cancel the one reserved CQ slot and account for
the single invalid descriptor.
Guard the existing xsk_inc_num_desc/xsk_drop_skb calls with an
xs->skb check (for the continuation case), and add an else branch
for the first-descriptor case that manually cancels the CQ slot and
increments invalid_descs by one.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6149f6a79897..6521604f8d42 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -893,9 +893,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
kfree_skb(skb);
if (err == -EOVERFLOW) {
- /* Drop the packet */
- xsk_inc_num_desc(xs->skb);
- xsk_drop_skb(xs->skb);
+ if (xs->skb) {
+ /* Drop the packet */
+ xsk_inc_num_desc(xs->skb);
+ xsk_drop_skb(xs->skb);
+ } else {
+ xsk_cq_cancel_locked(xs->pool, 1);
+ xs->tx->invalid_descs++;
+ }
xskq_cons_release(xs->tx);
} else {
/* Let application retry */
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue
2026-04-20 8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-20 19:34 ` Stanislav Fomichev
0 siblings, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
To: Jason Xing; +Cc: bpf, netdev, Jason Xing
> From: Jason Xing <kernelxing@tencent.com>
>
> When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
> xsk_build_skb_zerocopy (e.g., MAX_SKB_FRAGS exceeded), the free_err
> EOVERFLOW handler unconditionally dereferences xs->skb via
> xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL
> pointer dereference.
>
> In this series, the skb is already freed by kfree_skb() inside
> xsk_build_skb_zerocopy for the first-descriptor case, so we only need
> to do the bookkeeping: cancel the one reserved CQ slot and account for
> the single invalid descriptor.
>
> Guard the existing xsk_inc_num_desc/xsk_drop_skb calls with an
> xs->skb check (for the continuation case), and add an else branch
> for the first-descriptor case that manually cancels the CQ slot and
> increments invalid_descs by one.
>
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> net/xdp/xsk.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 6149f6a79897..6521604f8d42 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -893,9 +893,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> kfree_skb(skb);
>
> if (err == -EOVERFLOW) {
> - /* Drop the packet */
> - xsk_inc_num_desc(xs->skb);
> - xsk_drop_skb(xs->skb);
> + if (xs->skb) {
> + /* Drop the packet */
> + xsk_inc_num_desc(xs->skb);
> + xsk_drop_skb(xs->skb);
> + } else {
> + xsk_cq_cancel_locked(xs->pool, 1);
> + xs->tx->invalid_descs++;
> + }
> xskq_cons_release(xs->tx);
> } else {
> /* Let application retry */
> --
> 2.41.3
>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
2026-04-20 8:27 ` [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices Jason Xing
2026-04-20 8:27 ` [PATCH net v2 2/8] xsk: handle NULL dereference of the skb without frags issue Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 19:34 ` Stanislav Fomichev
2026-04-20 8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
` (4 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.
If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:
if (skb && !skb_shinfo(skb)->nr_frags)
kfree_skb(skb);
evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.
Fix by adding a !xs->skb check to the condition, ensuring we only free
skbs that were freshly allocated in this call (xs->skb is NULL) and
never free an in-progress multi-buffer skb that the caller still
references.
Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6521604f8d42..4fdd1a45a9bd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
return skb;
free_err:
- if (skb && !skb_shinfo(skb)->nr_frags)
+ if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
kfree_skb(skb);
if (err == -EOVERFLOW) {
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
2026-04-20 8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-04-20 19:34 ` Stanislav Fomichev
2026-04-21 0:01 ` Jason Xing
0 siblings, 1 reply; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
To: Jason Xing; +Cc: bpf, netdev, Jason Xing
> From: Jason Xing <kernelxing@tencent.com>
>
> When xsk_build_skb() processes multi-buffer packets in copy mode, the
> first descriptor stores data into the skb linear area without adding
> any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> to accumulate subsequent descriptors.
>
> If a continuation descriptor fails (e.g. alloc_page returns NULL with
> -EAGAIN), we jump to free_err where the condition:
>
> if (skb && !skb_shinfo(skb)->nr_frags)
> kfree_skb(skb);
>
> evaluates to true because nr_frags is still 0 (the first descriptor
> used the linear area, not frags). This frees the skb while xs->skb
> still points to it, creating a dangling pointer. On the next transmit
> attempt or socket close, xs->skb is dereferenced, causing a
> use-after-free or double-free.
>
> Fix by adding a !xs->skb check to the condition, ensuring we only free
> skbs that were freshly allocated in this call (xs->skb is NULL) and
> never free an in-progress multi-buffer skb that the caller still
> references.
>
> Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
> Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> net/xdp/xsk.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 6521604f8d42..4fdd1a45a9bd 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> return skb;
>
> free_err:
> - if (skb && !skb_shinfo(skb)->nr_frags)
> + if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> kfree_skb(skb);
>
> if (err == -EOVERFLOW) {
> --
> 2.41.3
Now "!skb_shinfo(skb)->nr_frags" feels redundant? It's either
"skb && !xs->skb" and we own the kfree. or "xs->skb != NULL" and we
want xsk_drop_skb? Or am I missing something?
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
2026-04-20 19:34 ` Stanislav Fomichev
@ 2026-04-21 0:01 ` Jason Xing
0 siblings, 0 replies; 21+ messages in thread
From: Jason Xing @ 2026-04-21 0:01 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing
On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When xsk_build_skb() processes multi-buffer packets in copy mode, the
> > first descriptor stores data into the skb linear area without adding
> > any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> > to accumulate subsequent descriptors.
> >
> > If a continuation descriptor fails (e.g. alloc_page returns NULL with
> > -EAGAIN), we jump to free_err where the condition:
> >
> > if (skb && !skb_shinfo(skb)->nr_frags)
> > kfree_skb(skb);
> >
> > evaluates to true because nr_frags is still 0 (the first descriptor
> > used the linear area, not frags). This frees the skb while xs->skb
> > still points to it, creating a dangling pointer. On the next transmit
> > attempt or socket close, xs->skb is dereferenced, causing a
> > use-after-free or double-free.
> >
> > Fix by adding a !xs->skb check to the condition, ensuring we only free
> > skbs that were freshly allocated in this call (xs->skb is NULL) and
> > never free an in-progress multi-buffer skb that the caller still
> > references.
> >
> > Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
> > Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > net/xdp/xsk.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 6521604f8d42..4fdd1a45a9bd 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > return skb;
> >
> > free_err:
> > - if (skb && !skb_shinfo(skb)->nr_frags)
> > + if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> > kfree_skb(skb);
> >
> > if (err == -EOVERFLOW) {
> > --
> > 2.41.3
>
> Now "!skb_shinfo(skb)->nr_frags" feels redundant? It's either
> "skb && !xs->skb" and we own the kfree. or "xs->skb != NULL" and we
> want xsk_drop_skb? Or am I missing something?
Your feeling about being redundant is right. I'm removing it now:)
At this stage, the job of this if statement is to find out the first
skb, so !xs->skb is a clear indicator as you said.
Thanks,
Jason
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (2 preceding siblings ...)
2026-04-20 8:28 ` [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 19:34 ` Stanislav Fomichev
2026-04-20 8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
` (3 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.
Introduce a new common helper to fix the issue. That function will be
used by the subsequent patches soon.
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4fdd1a45a9bd..614e7bd1252b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
return 0;
}
+static void xsk_drop_untrans_skb(struct sk_buff *skb)
+{
+ skb->destructor = sock_wfree;
+ kfree_skb(skb);
+}
+
static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
struct xdp_desc *desc)
{
@@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
free_err:
if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
- kfree_skb(skb);
+ xsk_drop_untrans_skb(skb);
if (err == -EOVERFLOW) {
if (xs->skb) {
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
2026-04-20 8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-04-20 19:34 ` Stanislav Fomichev
2026-04-21 0:51 ` Jason Xing
0 siblings, 1 reply; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:34 UTC (permalink / raw)
To: Jason Xing; +Cc: bpf, netdev, Jason Xing
> From: Jason Xing <kernelxing@tencent.com>
>
> Once xsk_skb_init_misc() has been called on an skb, its destructor is
> set to xsk_destruct_skb(), which submits the descriptor address(es) to
> the completion queue and advances the CQ producer. If such an skb is
> subsequently freed via kfree_skb() along an error path - before the
> skb has ever been handed to the driver - the destructor still runs and
> submits a bogus, half-initialized address to the CQ.
>
> Introduce a new common helper to fix the issue. That function will be
> used by the subsequent patches soon.
>
> Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
> Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> net/xdp/xsk.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4fdd1a45a9bd..614e7bd1252b 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> return 0;
> }
>
> +static void xsk_drop_untrans_skb(struct sk_buff *skb)
> +{
> + skb->destructor = sock_wfree;
> + kfree_skb(skb);
> +}
> +
> static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> struct xdp_desc *desc)
> {
> @@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>
> free_err:
> if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> - kfree_skb(skb);
> + xsk_drop_untrans_skb(skb);
>
> if (err == -EOVERFLOW) {
> if (xs->skb) {
> --
> 2.41.3
>
Have you considered the alternative where we postpone `skb->destructor =
xsk_destruct_skb` to a later point? Will this be less messy than
undoing that descriptor in a few curated places?
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
2026-04-20 19:34 ` Stanislav Fomichev
@ 2026-04-21 0:51 ` Jason Xing
0 siblings, 0 replies; 21+ messages in thread
From: Jason Xing @ 2026-04-21 0:51 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing
On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Once xsk_skb_init_misc() has been called on an skb, its destructor is
> > set to xsk_destruct_skb(), which submits the descriptor address(es) to
> > the completion queue and advances the CQ producer. If such an skb is
> > subsequently freed via kfree_skb() along an error path - before the
> > skb has ever been handed to the driver - the destructor still runs and
> > submits a bogus, half-initialized address to the CQ.
> >
> > Introduce a new common helper to fix the issue. That function will be
> > used by the subsequent patches soon.
> >
> > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/
> > Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > net/xdp/xsk.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4fdd1a45a9bd..614e7bd1252b 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -717,6 +717,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> > return 0;
> > }
> >
> > +static void xsk_drop_untrans_skb(struct sk_buff *skb)
> > +{
> > + skb->destructor = sock_wfree;
> > + kfree_skb(skb);
> > +}
> > +
> > static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > struct xdp_desc *desc)
> > {
> > @@ -890,7 +896,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> > free_err:
> > if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> > - kfree_skb(skb);
> > + xsk_drop_untrans_skb(skb);
> >
> > if (err == -EOVERFLOW) {
> > if (xs->skb) {
> > --
> > 2.41.3
> >
>
> Have you considered the alternative where we postpone `skb->destructor =
> xsk_destruct_skb` to a later point? Will this be less messy than
> undoing that descriptor in a few curated places?
Thanks. It does make sense. Adding the following code seems fine:
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 887abed25466..02e723ca3080 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -820,7 +820,6 @@ static struct sk_buff
*xsk_build_skb_zerocopy(struct xdp_sock *xs,
skb_reserve(skb, hr);
- xsk_skb_init_misc(skb, xs, desc->addr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
if (unlikely(err))
@@ -914,7 +913,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
if (unlikely(err))
goto free_err;
- xsk_skb_init_misc(skb, xs, desc->addr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc,
xs->pool, hr);
@@ -964,6 +962,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
}
+ if (!xs->skb)
+ xsk_skb_init_misc(skb, xs, desc->addr);
xsk_inc_num_desc(skb);
return skb;
Thanks,
Jason
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (3 preceding siblings ...)
2026-04-20 8:28 ` [PATCH net v2 4/8] xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 8:28 ` [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
` (2 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.
Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/
Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 614e7bd1252b..51f76e9d6ffd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -749,8 +749,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
xsk_skb_init_misc(skb, xs, desc->addr);
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
- if (unlikely(err))
+ if (unlikely(err)) {
+ xsk_drop_untrans_skb(skb);
return ERR_PTR(err);
+ }
}
} else {
struct xsk_addrs *xsk_addr;
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (4 preceding siblings ...)
2026-04-20 8:28 ` [PATCH net v2 5/8] xsk: avoid skb leak in XDP_TX_METADATA case Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
2026-04-20 8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
7 siblings, 0 replies; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
is why bug can be triggered.
4. there is no chance to free this skb anymore.
Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 51f76e9d6ffd..9236ec32b54a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -784,8 +784,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
addr = buffer - pool->addrs;
for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) {
- if (unlikely(i >= MAX_SKB_FRAGS))
+ if (unlikely(i >= MAX_SKB_FRAGS)) {
+ if (!xs->skb)
+ xsk_drop_untrans_skb(skb);
return ERR_PTR(-EOVERFLOW);
+ }
page = pool->umem->pgs[addr >> PAGE_SHIFT];
get_page(page);
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (5 preceding siblings ...)
2026-04-20 8:28 ` [PATCH net v2 6/8] xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 19:58 ` Stanislav Fomichev
2026-04-20 8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).
If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1
If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:
if (unlikely(num_descs > 1))
kmem_cache_free(xsk_tx_generic_cache, destructor_arg);
Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.
Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9236ec32b54a..6b17974ca825 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
spin_lock_irqsave(&pool->cq_prod_lock, flags);
idx = xskq_get_prod(pool->cq);
- if (unlikely(num_descs > 1)) {
+ if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
for (i = 0; i < num_descs; i++) {
@@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
u32 num_descs = xsk_get_num_desc(skb);
struct xsk_addrs *xsk_addr;
- if (unlikely(num_descs > 1)) {
+ if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
}
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path
2026-04-20 8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-04-20 19:58 ` Stanislav Fomichev
0 siblings, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:58 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, bpf, netdev, Jason Xing
On 04/20, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
>
> When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
> continuation descriptor, it promotes destructor_arg from an inlined
> address to a freshly allocated xsk_addrs (num_descs = 1). The counter
> is bumped to >= 2 only at the very end of a successful build (by calling
> xsk_inc_num_desc()).
>
> If the build fails in between (e.g. alloc_page() returns NULL with
> -EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
> calling xsk_inc_num_desc() to increment num_descs and leave the half-built
> skb attached to xs->skb for the app to retry. The skb now has
> 1) destructor_arg = a real xsk_addrs pointer,
> 2) num_descs = 1
>
> If the app never retries and just close()s the socket, xsk_release()
> calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
> free xsk_addrs by testing num_descs > 1:
>
> if (unlikely(num_descs > 1))
> kmem_cache_free(xsk_tx_generic_cache, destructor_arg);
>
> Because num_descs is exactly 1 the branch is skipped and the
> xsk_addrs object is leaked to the xsk_tx_generic_cache slab.
>
> Fix it by directly testing if destructor_arg is still addr. Or else it
> is modified and used to store the newly allocated memory from
> xsk_tx_generic_cache regardless of increment of num_desc, which we
> need to handle.
>
> Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
> Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> net/xdp/xsk.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 9236ec32b54a..6b17974ca825 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -605,7 +605,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
> spin_lock_irqsave(&pool->cq_prod_lock, flags);
> idx = xskq_get_prod(pool->cq);
>
> - if (unlikely(num_descs > 1)) {
> + if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
> xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
>
> for (i = 0; i < num_descs; i++) {
> @@ -660,7 +660,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
> u32 num_descs = xsk_get_num_desc(skb);
> struct xsk_addrs *xsk_addr;
>
> - if (unlikely(num_descs > 1)) {
> + if (unlikely(!xsk_skb_destructor_is_addr(skb))) {
> xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
> }
> --
> 2.41.3
>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-04-20 8:27 [PATCH net v2 0/8] xsk: fix bugs around xsk skb allocation Jason Xing
` (6 preceding siblings ...)
2026-04-20 8:28 ` [PATCH net v2 7/8] xsk: fix xsk_addrs slab leak on multi-buffer error path Jason Xing
@ 2026-04-20 8:28 ` Jason Xing
2026-04-20 19:49 ` Stanislav Fomichev
7 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-20 8:28 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:
skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
mode the chunk offset is encoded in bits 48-63 of the descriptor
address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
lost entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.
Fix this by handling the 32-bit case directly in
xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
struct (the same path already used for multi-descriptor SKBs) to store
the full u64 address.
Extend xsk_drop_untrans_skb() to free the xsk_addrs allocation on 32-bit
when the skb is dropped before transmission. Note that here we don't use
0x1UL method to judge in this case.
Also extend xsk_skb_destructor_is_addr() to cover 32-bit case like above.
The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/
Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 47 insertions(+), 7 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6b17974ca825..bd49dbd9875b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -556,9 +556,23 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
return ret;
}
+/*
+ * On 64-bit, destructor_arg can store an inline address directly
+ * (tagged with bit 0 set). On 32-bit, all addresses go through an
+ * allocated xsk_addrs struct instead. In that case this function
+ * returns true only when destructor_arg is NULL (set_addr has not
+ * yet been called or has failed).
+ *
+ * For all callers:
+ * return true: no xsk_addrs struct to handle
+ * return false: destructor_arg points to an xsk_addrs struct
+ */
static bool xsk_skb_destructor_is_addr(struct sk_buff *skb)
{
- return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+ if (IS_ENABLED(CONFIG_64BIT))
+ return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+ else
+ return !skb_shinfo(skb)->destructor_arg;
}
static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
@@ -566,9 +580,21 @@ static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
}
-static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+static int xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
{
+ if (!IS_ENABLED(CONFIG_64BIT)) {
+ struct xsk_addrs *xsk_addr;
+
+ xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
+ if (!xsk_addr)
+ return -ENOMEM;
+ xsk_addr->addrs[0] = addr;
+ skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+ return 0;
+ }
+
skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+ return 0;
}
static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -644,14 +670,14 @@ void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
}
-static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
- u64 addr)
+static int xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
+ u64 addr)
{
skb->dev = xs->dev;
skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
skb->destructor = xsk_destruct_skb;
- xsk_skb_destructor_set_addr(skb, addr);
+ return xsk_skb_destructor_set_addr(skb, addr);
}
static void xsk_consume_skb(struct sk_buff *skb)
@@ -719,6 +745,12 @@ static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
static void xsk_drop_untrans_skb(struct sk_buff *skb)
{
+ if (!IS_ENABLED(CONFIG_64BIT) && !xsk_skb_destructor_is_addr(skb)) {
+ struct xsk_addrs *xsk_addr;
+
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
+ }
skb->destructor = sock_wfree;
kfree_skb(skb);
}
@@ -746,7 +778,12 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
skb_reserve(skb, hr);
- xsk_skb_init_misc(skb, xs, desc->addr);
+ err = xsk_skb_init_misc(skb, xs, desc->addr);
+ if (unlikely(err)) {
+ xsk_drop_untrans_skb(skb);
+ return ERR_PTR(err);
+ }
+
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc, pool, hr);
if (unlikely(err)) {
@@ -845,7 +882,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
if (unlikely(err))
goto free_err;
- xsk_skb_init_misc(skb, xs, desc->addr);
+ err = xsk_skb_init_misc(skb, xs, desc->addr);
+ if (unlikely(err))
+ goto free_err;
+
if (desc->options & XDP_TX_METADATA) {
err = xsk_skb_metadata(skb, buffer, desc,
xs->pool, hr);
--
2.41.3
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-04-20 8:28 ` [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures Jason Xing
@ 2026-04-20 19:49 ` Stanislav Fomichev
2026-04-21 0:49 ` Jason Xing
0 siblings, 1 reply; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-20 19:49 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, bpf, netdev, Jason Xing
On 04/20, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
>
> In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> uintptr_t cast:
>
> skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
>
> On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> mode the chunk offset is encoded in bits 48-63 of the descriptor
> address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> lost entirely. The completion queue then returns a truncated address to
> userspace, making buffer recycling impossible.
>
> Fix this by handling the 32-bit case directly in
> xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> struct (the same path already used for multi-descriptor SKBs) to store
> the full u64 address.
Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does
anybody seriously run af_xdp on 32 bit systems?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-04-20 19:49 ` Stanislav Fomichev
@ 2026-04-21 0:49 ` Jason Xing
2026-04-21 22:23 ` Stanislav Fomichev
0 siblings, 1 reply; 21+ messages in thread
From: Jason Xing @ 2026-04-21 0:49 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, bpf, netdev, Jason Xing
On Tue, Apr 21, 2026 at 3:49 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> On 04/20, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> > descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> > uintptr_t cast:
> >
> > skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> >
> > On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> > the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> > mode the chunk offset is encoded in bits 48-63 of the descriptor
> > address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> > lost entirely. The completion queue then returns a truncated address to
> > userspace, making buffer recycling impossible.
> >
> > Fix this by handling the 32-bit case directly in
> > xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> > struct (the same path already used for multi-descriptor SKBs) to store
> > the full u64 address.
>
> Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does
Of course, it would be super easy. Actually the initial version looks
like this. One line of coder is simply enough.
> anybody seriously run af_xdp on 32 bit systems?
But my worry as you guess is if there exists a 32 bit system? I doubt
it. That's why I put some effort into adding the compatibility code to
cover the case. Good news is that it doesn't add any side effects of
the 64 bit system since they are protected under IS_ENABLED condition.
Thanks,
Jason
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net v2 8/8] xsk: fix u64 descriptor address truncation on 32-bit architectures
2026-04-21 0:49 ` Jason Xing
@ 2026-04-21 22:23 ` Stanislav Fomichev
0 siblings, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-04-21 22:23 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, bpf, netdev, Jason Xing
On 04/21, Jason Xing wrote:
> On Tue, Apr 21, 2026 at 3:49 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
> >
> > On 04/20, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
> > > descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
> > > uintptr_t cast:
> > >
> > > skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> > >
> > > On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
> > > the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
> > > mode the chunk offset is encoded in bits 48-63 of the descriptor
> > > address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
> > > lost entirely. The completion queue then returns a truncated address to
> > > userspace, making buffer recycling impossible.
> > >
> > > Fix this by handling the 32-bit case directly in
> > > xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an xsk_addrs
> > > struct (the same path already used for multi-descriptor SKBs) to store
> > > the full u64 address.
> >
> > Is it easier to make XSK `depends on 64BIT` to avoid dealing with that? Does
>
> Of course, it would be super easy. Actually the initial version looks
> like this. One line of coder is simply enough.
>
> > anybody seriously run af_xdp on 32 bit systems?
>
> But my worry as you guess is if there exists a 32 bit system? I doubt
> it. That's why I put some effort into adding the compatibility code to
> cover the case. Good news is that it doesn't add any side effects of
> the 64 bit system since they are protected under IS_ENABLED condition.
If someone complains, we can follow up? af_xdp is all u64 everywhere,
including uapi, I doubt someone is using it on 32 bit systems. We don't
test this part on 32 bit systems on nipa either..
^ permalink raw reply [flat|nested] 21+ messages in thread