* [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling
@ 2026-02-17 21:08 Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao
0 siblings, 2 replies; 6+ messages in thread
From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw)
To: netdev
Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem,
edumazet, kuba, pabeni, horms, kerneljasonxing
This series fixes two issues in AF_XDP zero-copy fragment handling:
Patch 1 fixes a buffer leak caused by incorrect list node handling after
commit b692bf9a7543. The list_node field is now reused for both the xskb
pool list and the buffer free list. Using list_del() instead of
list_del_init() causes list_empty() checks in xp_free() to fail, preventing
buffers from being added to the free list.
Patch 2 fixes partial packet delivery to userspace. In the zero-copy path,
if the Rx queue fills up while enqueuing fragments, the remaining fragments
are dropped, causing the application to receive incomplete packets. The fix
ensures the Rx queue has sufficient space for all fragments before starting
to enqueue them.
v4 changes:
- Patch 1: Carried Acked-by tags from v2 on patch 1
- Patch 2:
* Fix uninitialized err when sufficient space for all
all fragments is not available [2]
v3 changes:
- Patch 1: Carried Acked-by tags from v2 on patch 1
- Patch 2:
* Check for free space only for the multi-buffer case, this preserves
single buffer performance (Maciej)
* Fix return without freeing buffer when sufficient space for all
all fragments is not available
v2 changes:
- Fix indentation issue reported by kernel test robot [1]
[1] https://lore.kernel.org/oe-kbuild-all/202602051720.YfZO23pZ-lkp@intel.com/
[2] https://lore.kernel.org/oe-kbuild-all/202602172046.vf9DtpdF-lkp@intel.com/
Nikhil P. Rao (2):
xsk: Fix fragment node deletion to prevent buffer leak
xsk: Fix zero-copy AF_XDP fragment drop
include/net/xdp_sock_drv.h | 6 +++---
net/xdp/xsk.c | 25 ++++++++++++++++---------
2 files changed, 19 insertions(+), 12 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak
2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao
@ 2026-02-17 21:08 ` Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao
1 sibling, 0 replies; 6+ messages in thread
From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw)
To: netdev
Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem,
edumazet, kuba, pabeni, horms, kerneljasonxing
After commit b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"),
the list_node field is reused for both the xskb pool list and the buffer
free list, this causes a buffer leak as described below.
xp_free() checks if a buffer is already on the free list using
list_empty(&xskb->list_node). When list_del() is used to remove a node
from the xskb pool list, it doesn't reinitialize the node pointers.
This means list_empty() will return false even after the node has been
removed, causing xp_free() to incorrectly skip adding the buffer to the
free list.
Fix this by using list_del_init() instead of list_del() in all fragment
handling paths, this ensures the list node is reinitialized after removal,
allowing the list_empty() to work correctly.
Fixes: b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
---
include/net/xdp_sock_drv.h | 6 +++---
net/xdp/xsk.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 242e34f771cc..aefc368449d5 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -122,7 +122,7 @@ static inline void xsk_buff_free(struct xdp_buff *xdp)
goto out;
list_for_each_entry_safe(pos, tmp, xskb_list, list_node) {
- list_del(&pos->list_node);
+ list_del_init(&pos->list_node);
xp_free(pos);
}
@@ -157,7 +157,7 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first)
frag = list_first_entry_or_null(&xskb->pool->xskb_list,
struct xdp_buff_xsk, list_node);
if (frag) {
- list_del(&frag->list_node);
+ list_del_init(&frag->list_node);
ret = &frag->xdp;
}
@@ -168,7 +168,7 @@ static inline void xsk_buff_del_frag(struct xdp_buff *xdp)
{
struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
- list_del(&xskb->list_node);
+ list_del_init(&xskb->list_node);
}
static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f093c3453f64..f2ec4f78bbb6 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -186,7 +186,7 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
err = __xsk_rcv_zc(xs, pos, len, contd);
if (err)
goto err;
- list_del(&pos->list_node);
+ list_del_init(&pos->list_node);
}
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop
2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao
@ 2026-02-17 21:08 ` Nikhil P. Rao
2026-02-19 22:55 ` Jakub Kicinski
1 sibling, 1 reply; 6+ messages in thread
From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw)
To: netdev
Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem,
edumazet, kuba, pabeni, horms, kerneljasonxing
AF_XDP should ensure that only a complete packet is sent to application.
In the zero-copy case, if the Rx queue gets full as fragments are being
enqueued, the remaining fragments are dropped.
For the multi-buffer case, add a check to ensure that the Rx queue has
enough space for all fragments of a packet before starting to enqueue
them.
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
---
net/xdp/xsk.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f2ec4f78bbb6..f7f816a5cb80 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
struct xdp_buff_xsk *pos, *tmp;
struct list_head *xskb_list;
u32 contd = 0;
+ u32 num_desc;
int err;
- if (frags)
+ if (frags) {
+ num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1;
contd = XDP_PKT_CONTD;
+ } else {
+ err = __xsk_rcv_zc(xs, xskb, len, contd);
+ if (err)
+ goto err;
+ return 0;
+ }
- err = __xsk_rcv_zc(xs, xskb, len, contd);
- if (err)
+ if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) {
+ xs->rx_queue_full++;
+ err = -ENOBUFS;
goto err;
- if (likely(!frags))
- return 0;
+ }
+ __xsk_rcv_zc(xs, xskb, len, contd);
xskb_list = &xskb->pool->xskb_list;
list_for_each_entry_safe(pos, tmp, xskb_list, list_node) {
if (list_is_singular(xskb_list))
contd = 0;
len = pos->xdp.data_end - pos->xdp.data;
- err = __xsk_rcv_zc(xs, pos, len, contd);
- if (err)
- goto err;
+ __xsk_rcv_zc(xs, pos, len, contd);
list_del_init(&pos->list_node);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop
2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao
@ 2026-02-19 22:55 ` Jakub Kicinski
2026-02-20 12:37 ` Maciej Fijalkowski
0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2026-02-19 22:55 UTC (permalink / raw)
To: Nikhil P. Rao
Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, davem, edumazet,
pabeni, horms, kerneljasonxing
On Tue, 17 Feb 2026 21:08:51 +0000 Nikhil P. Rao wrote:
> AF_XDP should ensure that only a complete packet is sent to application.
> In the zero-copy case, if the Rx queue gets full as fragments are being
> enqueued, the remaining fragments are dropped.
>
> For the multi-buffer case, add a check to ensure that the Rx queue has
> enough space for all fragments of a packet before starting to enqueue
> them.
>
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
> ---
> net/xdp/xsk.c | 23 +++++++++++++++--------
> 1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index f2ec4f78bbb6..f7f816a5cb80 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> struct xdp_buff_xsk *pos, *tmp;
> struct list_head *xskb_list;
> u32 contd = 0;
> + u32 num_desc;
> int err;
>
> - if (frags)
> + if (frags) {
> + num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1;
> contd = XDP_PKT_CONTD;
[1]
> + } else {
> + err = __xsk_rcv_zc(xs, xskb, len, contd);
> + if (err)
> + goto err;
> + return 0;
> + }
>
> - err = __xsk_rcv_zc(xs, xskb, len, contd);
> - if (err)
> + if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) {
We can pull this check into the branch at [1]
It will let us preserve the existing flow.
Either that or handle the non-frag case fully upfront:
if (likely(!frags)) {
err = __xsk_rcv_zc(xs, xskb, len, 0);
if (err)
goto err;
return 0;
}
As is you have a weird mix of the two.
> + xs->rx_queue_full++;
> + err = -ENOBUFS;
> goto err;
> - if (likely(!frags))
> - return 0;
> + }
>
> + __xsk_rcv_zc(xs, xskb, len, contd);
Personal preference perhaps but removing error checking always
gives me pause. Maybe:
bool frag_fail;
frag_fail = __xsk_rcv_zc(xs, xskb, len, contd);
list_for_each...
...
frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd);
DEBUG_NET_WARN_ON_ONCE(frag_fail);
?
> xskb_list = &xskb->pool->xskb_list;
> list_for_each_entry_safe(pos, tmp, xskb_list, list_node) {
> if (list_is_singular(xskb_list))
> contd = 0;
> len = pos->xdp.data_end - pos->xdp.data;
> - err = __xsk_rcv_zc(xs, pos, len, contd);
> - if (err)
> - goto err;
> + __xsk_rcv_zc(xs, pos, len, contd);
> list_del_init(&pos->list_node);
> }
>
--
pw-bot: cr
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop
2026-02-19 22:55 ` Jakub Kicinski
@ 2026-02-20 12:37 ` Maciej Fijalkowski
2026-02-20 20:38 ` Jakub Kicinski
0 siblings, 1 reply; 6+ messages in thread
From: Maciej Fijalkowski @ 2026-02-20 12:37 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Nikhil P. Rao, netdev, magnus.karlsson, sdf, davem, edumazet,
pabeni, horms, kerneljasonxing
On Thu, Feb 19, 2026 at 02:55:29PM -0800, Jakub Kicinski wrote:
> On Tue, 17 Feb 2026 21:08:51 +0000 Nikhil P. Rao wrote:
> > AF_XDP should ensure that only a complete packet is sent to application.
> > In the zero-copy case, if the Rx queue gets full as fragments are being
> > enqueued, the remaining fragments are dropped.
> >
> > For the multi-buffer case, add a check to ensure that the Rx queue has
> > enough space for all fragments of a packet before starting to enqueue
> > them.
> >
> > Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> > Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
> > ---
> > net/xdp/xsk.c | 23 +++++++++++++++--------
> > 1 file changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index f2ec4f78bbb6..f7f816a5cb80 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> > struct xdp_buff_xsk *pos, *tmp;
> > struct list_head *xskb_list;
> > u32 contd = 0;
> > + u32 num_desc;
> > int err;
> >
> > - if (frags)
> > + if (frags) {
> > + num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1;
> > contd = XDP_PKT_CONTD;
>
> [1]
>
> > + } else {
> > + err = __xsk_rcv_zc(xs, xskb, len, contd);
> > + if (err)
> > + goto err;
> > + return 0;
> > + }
> >
> > - err = __xsk_rcv_zc(xs, xskb, len, contd);
> > - if (err)
> > + if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) {
>
> We can pull this check into the branch at [1]
> It will let us preserve the existing flow.
Hi Jakub,
that would work, yes.
>
> Either that or handle the non-frag case fully upfront:
>
> if (likely(!frags)) {
> err = __xsk_rcv_zc(xs, xskb, len, 0);
> if (err)
> goto err;
> return 0;
> }
>
> As is you have a weird mix of the two.
>
> > + xs->rx_queue_full++;
> > + err = -ENOBUFS;
> > goto err;
> > - if (likely(!frags))
> > - return 0;
> > + }
> >
> > + __xsk_rcv_zc(xs, xskb, len, contd);
>
> Personal preference perhaps but removing error checking always
> gives me pause. Maybe:
>
> bool frag_fail;
>
> frag_fail = __xsk_rcv_zc(xs, xskb, len, contd);
> list_for_each...
> ...
> frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd);
> DEBUG_NET_WARN_ON_ONCE(frag_fail);
error checking can be actually skipped as xskq_prod_nb_free() peeked into
xsk rx queue and told us there is enough space for descriptor production.
I have sent a patch that adds a variant of __xsk_rcv_zc() that skips
xskq_prod_reserve_desc():
https://lore.kernel.org/bpf/20260218150000.301176-1-maciej.fijalkowski@intel.com/
Logistics of these patches (this set & patch linked above) are a bit of a
question to me though since what Nikhil sent are clearly a fixes that need
backports whereas mine was sent as an improvement towards -next tree.
However, path that Nikhil touched here should be adjusted to what my patch
introduces. I might do this as a follow-up once bpf is merged to bpf-next.
Nikhil, I also see you routed the set to 'net' tree, previously xsk core
was handled via bpf/bpf-next.
>
> ?
>
> > xskb_list = &xskb->pool->xskb_list;
> > list_for_each_entry_safe(pos, tmp, xskb_list, list_node) {
> > if (list_is_singular(xskb_list))
> > contd = 0;
> > len = pos->xdp.data_end - pos->xdp.data;
> > - err = __xsk_rcv_zc(xs, pos, len, contd);
> > - if (err)
> > - goto err;
> > + __xsk_rcv_zc(xs, pos, len, contd);
> > list_del_init(&pos->list_node);
> > }
> >
> --
> pw-bot: cr
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop
2026-02-20 12:37 ` Maciej Fijalkowski
@ 2026-02-20 20:38 ` Jakub Kicinski
0 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2026-02-20 20:38 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Nikhil P. Rao, netdev, magnus.karlsson, sdf, davem, edumazet,
pabeni, horms, kerneljasonxing
On Fri, 20 Feb 2026 13:37:09 +0100 Maciej Fijalkowski wrote:
> > Personal preference perhaps but removing error checking always
> > gives me pause. Maybe:
> >
> > bool frag_fail;
> >
> > frag_fail = __xsk_rcv_zc(xs, xskb, len, contd);
> > list_for_each...
> > ...
> > frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd);
> > DEBUG_NET_WARN_ON_ONCE(frag_fail);
>
> error checking can be actually skipped as xskq_prod_nb_free() peeked into
> xsk rx queue and told us there is enough space for descriptor production.
Understood. I was wondering whether the assert / DEBUG_NET.. may still
be worth keeping but up to you.
> Nikhil, I also see you routed the set to 'net' tree, previously xsk core
> was handled via bpf/bpf-next.
Dunno... We had mixed results, I think net / net-next is fine for stuff
that's purely packet movement. There is no BPF dependency here..
Unless you have a reason! I'm not feeling strongly. Just the stuff we
were fixing recently made me wonder if this code is getting sufficient
networking eyes..
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-20 20:38 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao
2026-02-19 22:55 ` Jakub Kicinski
2026-02-20 12:37 ` Maciej Fijalkowski
2026-02-20 20:38 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox