* [PATCH 1/2 bpf] xdp: add XDP extension to skb @ 2025-10-28 16:01 Fernando Fernandez Mancera 2025-10-28 16:02 ` [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera 0 siblings, 1 reply; 4+ messages in thread From: Fernando Fernandez Mancera @ 2025-10-28 16:01 UTC (permalink / raw) To: bpf Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing, fw, Fernando Fernandez Mancera This patch adds a new skb extension for XDP representing the number of cq descriptors and a linked list of umem addresses. This is going to be used from the xsk skb destructor to put the umem addresses onto pool's completion queue. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> --- Note: CC'ing Florian Westphal in case I am missing something while implementing/using the skb extension. --- include/linux/skbuff.h | 3 +++ include/net/xdp_sock.h | 5 +++++ net/core/skbuff.c | 4 ++++ net/xdp/Kconfig | 1 + 4 files changed, 13 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index fb3fec9affaa..1c4a598b6564 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4910,6 +4910,9 @@ enum skb_ext_id { #endif #if IS_ENABLED(CONFIG_INET_PSP) SKB_EXT_PSP, +#endif +#if IS_ENABLED(CONFIG_XDP_SOCKETS) + SKB_EXT_XDP, #endif SKB_EXT_NUM, /* must be last */ }; diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index ce587a225661..94c607093768 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -120,6 +120,11 @@ struct xsk_tx_metadata_ops { void (*tmo_request_launch_time)(u64 launch_time, void *priv); }; +struct xdp_skb_ext { + u32 num_descs; + struct list_head addrs_list; +}; + #ifdef CONFIG_XDP_SOCKETS int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6be01454f262..f3966b8c61ee 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -81,6 +81,7 @@ #include <net/page_pool/helpers.h> #include <net/psp/types.h> #include <net/dropreason.h> +#include <net/xdp_sock.h> #include <linux/uaccess.h> #include <trace/events/skb.h> @@ -5066,6 +5067,9 @@ static const u8 skb_ext_type_len[] = { #if IS_ENABLED(CONFIG_INET_PSP) [SKB_EXT_PSP] = SKB_EXT_CHUNKSIZEOF(struct psp_skb_ext), #endif +#if IS_ENABLED(CONFIG_XDP_SOCKETS) + [SKB_EXT_XDP] = SKB_EXT_CHUNKSIZEOF(struct xdp_skb_ext), +#endif }; static __always_inline unsigned int skb_ext_total_length(void) diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig index 71af2febe72a..89546c48ac2a 100644 --- a/net/xdp/Kconfig +++ b/net/xdp/Kconfig @@ -2,6 +2,7 @@ config XDP_SOCKETS bool "XDP sockets" depends on BPF_SYSCALL + select SKB_EXTENSIONS default n help XDP sockets allows a channel between XDP programs and -- 2.51.0 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number 2025-10-28 16:01 [PATCH 1/2 bpf] xdp: add XDP extension to skb Fernando Fernandez Mancera @ 2025-10-28 16:02 ` Fernando Fernandez Mancera 2025-10-28 16:34 ` Fernando Fernandez Mancera 2025-10-28 16:45 ` bot+bpf-ci 0 siblings, 2 replies; 4+ messages in thread From: Fernando Fernandez Mancera @ 2025-10-28 16:02 UTC (permalink / raw) To: bpf Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing, fw, Fernando Fernandez Mancera Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor production"), the descriptor number is stored in skb control block and xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto pool's completion queue. skb control block shouldn't be used for this purpose as after transmit xsk doesn't have control over it and other subsystems could use it. This leads to the following kernel panic due to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:xsk_destruct_skb+0xd0/0x180 [...] Call Trace: <IRQ> ? napi_complete_done+0x7a/0x1a0 ip_rcv_core+0x1bb/0x340 ip_rcv+0x30/0x1f0 __netif_receive_skb_one_core+0x85/0xa0 process_backlog+0x87/0x130 __napi_poll+0x28/0x180 net_rx_action+0x339/0x420 handle_softirqs+0xdc/0x320 ? handle_edge_irq+0x90/0x1e0 do_softirq.part.0+0x3b/0x60 </IRQ> <TASK> __local_bh_enable_ip+0x60/0x70 __dev_direct_xmit+0x14e/0x1f0 __xsk_generic_xmit+0x482/0xb70 ? __remove_hrtimer+0x41/0xa0 ? __xsk_generic_xmit+0x51/0xb70 ? _raw_spin_unlock_irqrestore+0xe/0x40 xsk_sendmsg+0xda/0x1c0 __sys_sendto+0x1ee/0x200 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x84/0x2f0 ? __pfx_pollwake+0x10/0x10 ? __rseq_handle_notify_resume+0xad/0x4c0 ? restore_fpregs_from_fpstate+0x3c/0x90 ?1082000 switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> [...] Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Use the skb XDP extension to store the number of cq descriptors along with a list of umem addresses. Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production") Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> --- Note: I did some testing with xdpsock tool, all good so far. Anyway, XDP expert testing would be really welcomed. --- net/xdp/xsk.c | 81 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 60 insertions(+), 21 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 7b0c68a70888..4f3fc005d1f5 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -41,15 +41,8 @@ struct xsk_addr_node { struct list_head addr_node; }; -struct xsk_addr_head { - u32 num_descs; - struct list_head addrs_list; -}; - static struct kmem_cache *xsk_tx_generic_cache; -#define XSKCB(skb) ((struct xsk_addr_head *)((skb)->cb)) - void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { if (pool->cached_need_wakeup & XDP_WAKEUP_RX) @@ -562,6 +555,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool, struct sk_buff *skb) { struct xsk_addr_node *pos, *tmp; + struct xdp_skb_ext *ext; u32 descs_processed = 0; unsigned long flags; u32 idx; @@ -573,14 +567,16 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool, (u64)(uintptr_t)skb_shinfo(skb)->destructor_arg); descs_processed++; - if (unlikely(XSKCB(skb)->num_descs > 1)) { - list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) { + ext = skb_ext_find(skb, SKB_EXT_XDP); + if (unlikely(ext && ext->num_descs > 1)) { + list_for_each_entry_safe(pos, tmp, &ext->addrs_list, addr_node) { xskq_prod_write_addr(pool->cq, idx + descs_processed, pos->addr); descs_processed++; list_del(&pos->addr_node); kmem_cache_free(xsk_tx_generic_cache, pos); } + skb_ext_del(skb, SKB_EXT_XDP); } xskq_prod_submit_n(pool->cq, descs_processed); spin_unlock_irqrestore(&pool->cq_lock, flags); @@ -597,12 +593,19 @@ static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n) static void xsk_inc_num_desc(struct sk_buff *skb) { - XSKCB(skb)->num_descs++; + struct xdp_skb_ext *ext; + + ext = skb_ext_find(skb, SKB_EXT_XDP); + if (ext) + ext->num_descs++; } static u32 xsk_get_num_desc(struct sk_buff *skb) { - return XSKCB(skb)->num_descs; + struct xdp_skb_ext *ext; + + ext = skb_ext_find(skb, SKB_EXT_XDP); + return (ext && ext->num_descs > 1) ? ext->num_descs : 1; } static void xsk_destruct_skb(struct sk_buff *skb) @@ -621,12 +624,9 @@ static void xsk_destruct_skb(struct sk_buff *skb) static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs, u64 addr) { - BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb)); - INIT_LIST_HEAD(&XSKCB(skb)->addrs_list); skb->dev = xs->dev; skb->priority = READ_ONCE(xs->sk.sk_priority); skb->mark = READ_ONCE(xs->sk.sk_mark); - XSKCB(skb)->num_descs = 0; skb->destructor = xsk_destruct_skb; skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr; } @@ -636,12 +636,15 @@ static void xsk_consume_skb(struct sk_buff *skb) struct xdp_sock *xs = xdp_sk(skb->sk); u32 num_descs = xsk_get_num_desc(skb); struct xsk_addr_node *pos, *tmp; + struct xdp_skb_ext *ext; - if (unlikely(num_descs > 1)) { - list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) { + ext = skb_ext_find(skb, SKB_EXT_XDP); + if (unlikely(ext && ext->num_descs > 1)) { + list_for_each_entry_safe(pos, tmp, &ext->addrs_list, addr_node) { list_del(&pos->addr_node); kmem_cache_free(xsk_tx_generic_cache, pos); } + skb_ext_del(skb, SKB_EXT_XDP); } skb->destructor = sock_wfree; @@ -727,16 +730,32 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, return ERR_PTR(err); } } else { + struct xdp_skb_ext *ext; + xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL); if (!xsk_addr) return ERR_PTR(-ENOMEM); + ext = skb_ext_find(skb, SKB_EXT_XDP); + if (!ext) { + ext = skb_ext_add(skb, SKB_EXT_XDP); + if (!ext) + return ERR_PTR(-ENOMEM); + memset(ext, 0, sizeof(*ext)); + INIT_LIST_HEAD(&ext->addrs_list); + ext->num_descs = 1; + } else if (ext->num_descs == 0) { + INIT_LIST_HEAD(&ext->addrs_list); + ext->num_descs = 1; + } + /* in case of -EOVERFLOW that could happen below, * xsk_consume_skb() will release this node as whole skb * would be dropped, which implies freeing all list elements */ xsk_addr->addr = desc->addr; - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list); + list_add_tail(&xsk_addr->addr_node, &ext->addrs_list); + xsk_inc_num_desc(skb); } len = desc->len; @@ -804,6 +823,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, if (unlikely(err)) goto free_err; + if (!skb_ext_add(skb, SKB_EXT_XDP)) { + err = -ENOMEM; + goto free_err; + } + xsk_skb_init_misc(skb, xs, desc->addr); if (desc->options & XDP_TX_METADATA) { err = xsk_skb_metadata(skb, buffer, desc, @@ -814,6 +838,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, } else { int nr_frags = skb_shinfo(skb)->nr_frags; struct xsk_addr_node *xsk_addr; + struct xdp_skb_ext *ext; struct page *page; u8 *vaddr; @@ -828,6 +853,22 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, goto free_err; } + ext = skb_ext_find(skb, SKB_EXT_XDP); + if (!ext) { + ext = skb_ext_add(skb, SKB_EXT_XDP); + if (!ext) { + __free_page(page); + err = -ENOMEM; + goto free_err; + } + memset(ext, 0, sizeof(*ext)); + INIT_LIST_HEAD(&ext->addrs_list); + ext->num_descs = 1; + } else if (ext->num_descs == 0) { + INIT_LIST_HEAD(&ext->addrs_list); + ext->num_descs = 1; + } + xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL); if (!xsk_addr) { __free_page(page); @@ -843,12 +884,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc); xsk_addr->addr = desc->addr; - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list); + list_add_tail(&xsk_addr->addr_node, &ext->addrs_list); + xsk_inc_num_desc(skb); } } - xsk_inc_num_desc(skb); - return skb; free_err: @@ -857,7 +897,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, if (err == -EOVERFLOW) { /* Drop the packet */ - xsk_inc_num_desc(xs->skb); xsk_drop_skb(xs->skb); xskq_cons_release(xs->tx); } else { -- 2.51.0 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number 2025-10-28 16:02 ` [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera @ 2025-10-28 16:34 ` Fernando Fernandez Mancera 2025-10-28 16:45 ` bot+bpf-ci 1 sibling, 0 replies; 4+ messages in thread From: Fernando Fernandez Mancera @ 2025-10-28 16:34 UTC (permalink / raw) To: bpf; +Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing, fw On 10/28/25 5:02 PM, Fernando Fernandez Mancera wrote: > Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor > production"), the descriptor number is stored in skb control block and > xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto > pool's completion queue. >[...] > > len = desc->len; > @@ -804,6 +823,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > if (unlikely(err)) > goto free_err; > > + if (!skb_ext_add(skb, SKB_EXT_XDP)) { > + err = -ENOMEM; > + goto free_err; > + } > + This is a leftover. Without it, the logic simplified and indeed the performance for non-fragmented traffic is not affected at all. While reviewing this, consider this line is dropped. I will send a V2 in 24 hours anyway as this is indeed introducing a buggy behavior. > xsk_skb_init_misc(skb, xs, desc->addr); > if (desc->options & XDP_TX_METADATA) { > err = xsk_skb_metadata(skb, buffer, desc, > @@ -814,6 +838,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > } else { > int nr_frags = skb_shinfo(skb)->nr_frags; > struct xsk_addr_node *xsk_addr; > + struct xdp_skb_ext *ext; > struct page *page; > u8 *vaddr; > > @@ -828,6 +853,22 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > goto free_err; > } > > + ext = skb_ext_find(skb, SKB_EXT_XDP); > + if (!ext) { > + ext = skb_ext_add(skb, SKB_EXT_XDP); > + if (!ext) { > + __free_page(page); > + err = -ENOMEM; > + goto free_err; > + } > + memset(ext, 0, sizeof(*ext)); > + INIT_LIST_HEAD(&ext->addrs_list); > + ext->num_descs = 1; > + } else if (ext->num_descs == 0) { > + INIT_LIST_HEAD(&ext->addrs_list); > + ext->num_descs = 1; > + } > + > xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL); > if (!xsk_addr) { > __free_page(page); > @@ -843,12 +884,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc); > > xsk_addr->addr = desc->addr; > - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list); > + list_add_tail(&xsk_addr->addr_node, &ext->addrs_list); > + xsk_inc_num_desc(skb); > } > } > > - xsk_inc_num_desc(skb); > - > return skb; > > free_err: > @@ -857,7 +897,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > if (err == -EOVERFLOW) { > /* Drop the packet */ > - xsk_inc_num_desc(xs->skb); > xsk_drop_skb(xs->skb); > xskq_cons_release(xs->tx); > } else { ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number 2025-10-28 16:02 ` [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera 2025-10-28 16:34 ` Fernando Fernandez Mancera @ 2025-10-28 16:45 ` bot+bpf-ci 1 sibling, 0 replies; 4+ messages in thread From: bot+bpf-ci @ 2025-10-28 16:45 UTC (permalink / raw) To: fmancera, bpf Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing, fw, fmancera, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 3529 bytes --] ``` commit 517548717cd8f9023f71443add6daabf21ef7135 Author: Fernando Fernandez Mancera <fmancera@suse.de> xsk: avoid data corruption on cq descriptor number This patch fixes data corruption in AF_XDP by migrating descriptor information from skb->cb (which can be overwritten by other subsystems after transmit) to skb XDP extension. Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/ > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 7b0c68a70..4f3fc005d 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c [ ... ] > @@ -727,16 +730,32 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > return ERR_PTR(err); > } > } else { > + struct xdp_skb_ext *ext; > + > xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL); > if (!xsk_addr) > return ERR_PTR(-ENOMEM); > > + ext = skb_ext_find(skb, SKB_EXT_XDP); > + if (!ext) { > + ext = skb_ext_add(skb, SKB_EXT_XDP); > + if (!ext) > + return ERR_PTR(-ENOMEM); ^^^^ In xsk_build_skb_zerocopy(), when skb_ext_add() fails after the kmem_cache_zalloc() succeeds, does this leak the xsk_addr allocation? The xsk_addr node is allocated first, then if skb_ext_add() returns NULL, the function returns ERR_PTR(-ENOMEM) without freeing xsk_addr. It hasn't been added to any list yet, so nothing will clean it up. > + memset(ext, 0, sizeof(*ext)); > + INIT_LIST_HEAD(&ext->addrs_list); > + ext->num_descs = 1; > + } else if (ext->num_descs == 0) { > + INIT_LIST_HEAD(&ext->addrs_list); > + ext->num_descs = 1; > + } > + > /* in case of -EOVERFLOW that could happen below, > * xsk_consume_skb() will release this node as whole skb > * would be dropped, which implies freeing all list elements > */ > xsk_addr->addr = desc->addr; > - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list); > + list_add_tail(&xsk_addr->addr_node, &ext->addrs_list); > + xsk_inc_num_desc(skb); > } [ ... ] > @@ -804,6 +823,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > if (unlikely(err)) > goto free_err; > > + if (!skb_ext_add(skb, SKB_EXT_XDP)) { > + err = -ENOMEM; > + goto free_err; > + } > + ^^^^ In xsk_build_skb() on the first-descriptor path, when skb_ext_add() succeeds, does the extension get initialized? The code adds the extension but doesn't capture the returned pointer or initialize it. Later, xsk_destruct_skb()->xsk_cq_submit_addr_locked() will read ext->num_descs, and for multi-buffer packets, subsequent descriptors will find this extension and try to use ext->addrs_list. Without initialization, ext->num_descs contains garbage which could be interpreted as > 1, and ext->addrs_list is an uninitialized list_head. This can cause the destructor to iterate a corrupt list, or list_add_tail() in the multi-buffer path to corrupt memory. Compare with the subsequent-descriptor path at lines 856-870 where the extension is properly initialized with memset() and INIT_LIST_HEAD() when added. > xsk_skb_init_misc(skb, xs, desc->addr); > if (desc->options & XDP_TX_METADATA) { > err = xsk_skb_metadata(skb, buffer, desc, [ ... ] ``` --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md In-Reply-To-Subject: `xsk: avoid data corruption on cq descriptor number` CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18881598580 ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-10-28 16:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-28 16:01 [PATCH 1/2 bpf] xdp: add XDP extension to skb Fernando Fernandez Mancera 2025-10-28 16:02 ` [PATCH 2/2 bpf] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera 2025-10-28 16:34 ` Fernando Fernandez Mancera 2025-10-28 16:45 ` bot+bpf-ci
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox