* [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling
@ 2026-02-17 21:08 Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao
2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao
0 siblings, 2 replies; 6+ messages in thread
From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw)
To: netdev
Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem,
edumazet, kuba, pabeni, horms, kerneljasonxing
This series fixes two issues in AF_XDP zero-copy fragment handling:
Patch 1 fixes a buffer leak caused by incorrect list node handling after
commit b692bf9a7543. The list_node field is now reused for both the xskb
pool list and the buffer free list. Using list_del() instead of
list_del_init() causes list_empty() checks in xp_free() to fail, preventing
buffers from being added to the free list.
Patch 2 fixes partial packet delivery to userspace. In the zero-copy path,
if the Rx queue fills up while enqueuing fragments, the remaining fragments
are dropped, causing the application to receive incomplete packets. The fix
ensures the Rx queue has sufficient space for all fragments before starting
to enqueue them.
v4 changes:
- Patch 1: Carried Acked-by tags from v2 on patch 1
- Patch 2:
* Fix uninitialized err when sufficient space for all
all fragments is not available [2]
v3 changes:
- Patch 1: Carried Acked-by tags from v2 on patch 1
- Patch 2:
* Check for free space only for the multi-buffer case, this preserves
single buffer performance (Maciej)
* Fix return without freeing buffer when sufficient space for all
all fragments is not available
v2 changes:
- Fix indentation issue reported by kernel test robot [1]
[1] https://lore.kernel.org/oe-kbuild-all/202602051720.YfZO23pZ-lkp@intel.com/
[2] https://lore.kernel.org/oe-kbuild-all/202602172046.vf9DtpdF-lkp@intel.com/
Nikhil P. Rao (2):
xsk: Fix fragment node deletion to prevent buffer leak
xsk: Fix zero-copy AF_XDP fragment drop
include/net/xdp_sock_drv.h | 6 +++---
net/xdp/xsk.c | 25 ++++++++++++++++---------
2 files changed, 19 insertions(+), 12 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak 2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao @ 2026-02-17 21:08 ` Nikhil P. Rao 2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao 1 sibling, 0 replies; 6+ messages in thread From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw) To: netdev Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem, edumazet, kuba, pabeni, horms, kerneljasonxing After commit b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> --- include/net/xdp_sock_drv.h | 6 +++--- net/xdp/xsk.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index 242e34f771cc..aefc368449d5 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -122,7 +122,7 @@ static inline void xsk_buff_free(struct xdp_buff *xdp) goto out; list_for_each_entry_safe(pos, tmp, xskb_list, list_node) { - list_del(&pos->list_node); + list_del_init(&pos->list_node); xp_free(pos); } @@ -157,7 +157,7 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first) frag = list_first_entry_or_null(&xskb->pool->xskb_list, struct xdp_buff_xsk, list_node); if (frag) { - list_del(&frag->list_node); + list_del_init(&frag->list_node); ret = &frag->xdp; } @@ -168,7 +168,7 @@ static inline void xsk_buff_del_frag(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); - list_del(&xskb->list_node); + list_del_init(&xskb->list_node); } static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index f093c3453f64..f2ec4f78bbb6 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -186,7 +186,7 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) err = __xsk_rcv_zc(xs, pos, len, contd); if (err) goto err; - list_del(&pos->list_node); + list_del_init(&pos->list_node); } return 0; -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop 2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao 2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao @ 2026-02-17 21:08 ` Nikhil P. Rao 2026-02-19 22:55 ` Jakub Kicinski 1 sibling, 1 reply; 6+ messages in thread From: Nikhil P. Rao @ 2026-02-17 21:08 UTC (permalink / raw) To: netdev Cc: nikhil.rao, magnus.karlsson, maciej.fijalkowski, sdf, davem, edumazet, kuba, pabeni, horms, kerneljasonxing AF_XDP should ensure that only a complete packet is sent to application. In the zero-copy case, if the Rx queue gets full as fragments are being enqueued, the remaining fragments are dropped. For the multi-buffer case, add a check to ensure that the Rx queue has enough space for all fragments of a packet before starting to enqueue them. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> --- net/xdp/xsk.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index f2ec4f78bbb6..f7f816a5cb80 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) struct xdp_buff_xsk *pos, *tmp; struct list_head *xskb_list; u32 contd = 0; + u32 num_desc; int err; - if (frags) + if (frags) { + num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1; contd = XDP_PKT_CONTD; + } else { + err = __xsk_rcv_zc(xs, xskb, len, contd); + if (err) + goto err; + return 0; + } - err = __xsk_rcv_zc(xs, xskb, len, contd); - if (err) + if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) { + xs->rx_queue_full++; + err = -ENOBUFS; goto err; - if (likely(!frags)) - return 0; + } + __xsk_rcv_zc(xs, xskb, len, contd); xskb_list = &xskb->pool->xskb_list; list_for_each_entry_safe(pos, tmp, xskb_list, list_node) { if (list_is_singular(xskb_list)) contd = 0; len = pos->xdp.data_end - pos->xdp.data; - err = __xsk_rcv_zc(xs, pos, len, contd); - if (err) - goto err; + __xsk_rcv_zc(xs, pos, len, contd); list_del_init(&pos->list_node); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop 2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao @ 2026-02-19 22:55 ` Jakub Kicinski 2026-02-20 12:37 ` Maciej Fijalkowski 0 siblings, 1 reply; 6+ messages in thread From: Jakub Kicinski @ 2026-02-19 22:55 UTC (permalink / raw) To: Nikhil P. Rao Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, davem, edumazet, pabeni, horms, kerneljasonxing On Tue, 17 Feb 2026 21:08:51 +0000 Nikhil P. Rao wrote: > AF_XDP should ensure that only a complete packet is sent to application. > In the zero-copy case, if the Rx queue gets full as fragments are being > enqueued, the remaining fragments are dropped. > > For the multi-buffer case, add a check to ensure that the Rx queue has > enough space for all fragments of a packet before starting to enqueue > them. > > Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") > Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> > --- > net/xdp/xsk.c | 23 +++++++++++++++-------- > 1 file changed, 15 insertions(+), 8 deletions(-) > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index f2ec4f78bbb6..f7f816a5cb80 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) > struct xdp_buff_xsk *pos, *tmp; > struct list_head *xskb_list; > u32 contd = 0; > + u32 num_desc; > int err; > > - if (frags) > + if (frags) { > + num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1; > contd = XDP_PKT_CONTD; [1] > + } else { > + err = __xsk_rcv_zc(xs, xskb, len, contd); > + if (err) > + goto err; > + return 0; > + } > > - err = __xsk_rcv_zc(xs, xskb, len, contd); > - if (err) > + if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) { We can pull this check into the branch at [1] It will let us preserve the existing flow. Either that or handle the non-frag case fully upfront: if (likely(!frags)) { err = __xsk_rcv_zc(xs, xskb, len, 0); if (err) goto err; return 0; } As is you have a weird mix of the two. > + xs->rx_queue_full++; > + err = -ENOBUFS; > goto err; > - if (likely(!frags)) > - return 0; > + } > > + __xsk_rcv_zc(xs, xskb, len, contd); Personal preference perhaps but removing error checking always gives me pause. Maybe: bool frag_fail; frag_fail = __xsk_rcv_zc(xs, xskb, len, contd); list_for_each... ... frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd); DEBUG_NET_WARN_ON_ONCE(frag_fail); ? > xskb_list = &xskb->pool->xskb_list; > list_for_each_entry_safe(pos, tmp, xskb_list, list_node) { > if (list_is_singular(xskb_list)) > contd = 0; > len = pos->xdp.data_end - pos->xdp.data; > - err = __xsk_rcv_zc(xs, pos, len, contd); > - if (err) > - goto err; > + __xsk_rcv_zc(xs, pos, len, contd); > list_del_init(&pos->list_node); > } > -- pw-bot: cr ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop 2026-02-19 22:55 ` Jakub Kicinski @ 2026-02-20 12:37 ` Maciej Fijalkowski 2026-02-20 20:38 ` Jakub Kicinski 0 siblings, 1 reply; 6+ messages in thread From: Maciej Fijalkowski @ 2026-02-20 12:37 UTC (permalink / raw) To: Jakub Kicinski Cc: Nikhil P. Rao, netdev, magnus.karlsson, sdf, davem, edumazet, pabeni, horms, kerneljasonxing On Thu, Feb 19, 2026 at 02:55:29PM -0800, Jakub Kicinski wrote: > On Tue, 17 Feb 2026 21:08:51 +0000 Nikhil P. Rao wrote: > > AF_XDP should ensure that only a complete packet is sent to application. > > In the zero-copy case, if the Rx queue gets full as fragments are being > > enqueued, the remaining fragments are dropped. > > > > For the multi-buffer case, add a check to ensure that the Rx queue has > > enough space for all fragments of a packet before starting to enqueue > > them. > > > > Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") > > Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > > Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> > > --- > > net/xdp/xsk.c | 23 +++++++++++++++-------- > > 1 file changed, 15 insertions(+), 8 deletions(-) > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index f2ec4f78bbb6..f7f816a5cb80 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -167,25 +167,32 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) > > struct xdp_buff_xsk *pos, *tmp; > > struct list_head *xskb_list; > > u32 contd = 0; > > + u32 num_desc; > > int err; > > > > - if (frags) > > + if (frags) { > > + num_desc = xdp_get_shared_info_from_buff(xdp)->nr_frags + 1; > > contd = XDP_PKT_CONTD; > > [1] > > > + } else { > > + err = __xsk_rcv_zc(xs, xskb, len, contd); > > + if (err) > > + goto err; > > + return 0; > > + } > > > > - err = __xsk_rcv_zc(xs, xskb, len, contd); > > - if (err) > > + if (xskq_prod_nb_free(xs->rx, num_desc) < num_desc) { > > We can pull this check into the branch at [1] > It will let us preserve the existing flow. Hi Jakub, that would work, yes. > > Either that or handle the non-frag case fully upfront: > > if (likely(!frags)) { > err = __xsk_rcv_zc(xs, xskb, len, 0); > if (err) > goto err; > return 0; > } > > As is you have a weird mix of the two. > > > + xs->rx_queue_full++; > > + err = -ENOBUFS; > > goto err; > > - if (likely(!frags)) > > - return 0; > > + } > > > > + __xsk_rcv_zc(xs, xskb, len, contd); > > Personal preference perhaps but removing error checking always > gives me pause. Maybe: > > bool frag_fail; > > frag_fail = __xsk_rcv_zc(xs, xskb, len, contd); > list_for_each... > ... > frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd); > DEBUG_NET_WARN_ON_ONCE(frag_fail); error checking can be actually skipped as xskq_prod_nb_free() peeked into xsk rx queue and told us there is enough space for descriptor production. I have sent a patch that adds a variant of __xsk_rcv_zc() that skips xskq_prod_reserve_desc(): https://lore.kernel.org/bpf/20260218150000.301176-1-maciej.fijalkowski@intel.com/ Logistics of these patches (this set & patch linked above) are a bit of a question to me though since what Nikhil sent are clearly a fixes that need backports whereas mine was sent as an improvement towards -next tree. However, path that Nikhil touched here should be adjusted to what my patch introduces. I might do this as a follow-up once bpf is merged to bpf-next. Nikhil, I also see you routed the set to 'net' tree, previously xsk core was handled via bpf/bpf-next. > > ? > > > xskb_list = &xskb->pool->xskb_list; > > list_for_each_entry_safe(pos, tmp, xskb_list, list_node) { > > if (list_is_singular(xskb_list)) > > contd = 0; > > len = pos->xdp.data_end - pos->xdp.data; > > - err = __xsk_rcv_zc(xs, pos, len, contd); > > - if (err) > > - goto err; > > + __xsk_rcv_zc(xs, pos, len, contd); > > list_del_init(&pos->list_node); > > } > > > -- > pw-bot: cr ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop 2026-02-20 12:37 ` Maciej Fijalkowski @ 2026-02-20 20:38 ` Jakub Kicinski 0 siblings, 0 replies; 6+ messages in thread From: Jakub Kicinski @ 2026-02-20 20:38 UTC (permalink / raw) To: Maciej Fijalkowski Cc: Nikhil P. Rao, netdev, magnus.karlsson, sdf, davem, edumazet, pabeni, horms, kerneljasonxing On Fri, 20 Feb 2026 13:37:09 +0100 Maciej Fijalkowski wrote: > > Personal preference perhaps but removing error checking always > > gives me pause. Maybe: > > > > bool frag_fail; > > > > frag_fail = __xsk_rcv_zc(xs, xskb, len, contd); > > list_for_each... > > ... > > frag_fail |= __xsk_rcv_zc(xs, xskb, len, contd); > > DEBUG_NET_WARN_ON_ONCE(frag_fail); > > error checking can be actually skipped as xskq_prod_nb_free() peeked into > xsk rx queue and told us there is enough space for descriptor production. Understood. I was wondering whether the assert / DEBUG_NET.. may still be worth keeping but up to you. > Nikhil, I also see you routed the set to 'net' tree, previously xsk core > was handled via bpf/bpf-next. Dunno... We had mixed results, I think net / net-next is fine for stuff that's purely packet movement. There is no BPF dependency here.. Unless you have a reason! I'm not feeling strongly. Just the stuff we were fixing recently made me wonder if this code is getting sufficient networking eyes.. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-20 20:38 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-17 21:08 [PATCH net v4 0/2] xsk: Fixes for AF_XDP fragment handling Nikhil P. Rao 2026-02-17 21:08 ` [PATCH net v4 1/2] xsk: Fix fragment node deletion to prevent buffer leak Nikhil P. Rao 2026-02-17 21:08 ` [PATCH net v4 2/2] xsk: Fix zero-copy AF_XDP fragment drop Nikhil P. Rao 2026-02-19 22:55 ` Jakub Kicinski 2026-02-20 12:37 ` Maciej Fijalkowski 2026-02-20 20:38 ` Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox