* [PATCH 1/2 bpf v2] xdp: add XDP extension to skb
@ 2025-10-28 18:30 Fernando Fernandez Mancera
2025-10-28 18:30 ` [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-10-28 22:55 ` [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Jakub Kicinski
0 siblings, 2 replies; 8+ messages in thread
From: Fernando Fernandez Mancera @ 2025-10-28 18:30 UTC (permalink / raw)
To: bpf
Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing,
fw, Fernando Fernandez Mancera
This patch adds a new skb extension for XDP representing the number of
cq descriptors and a linked list of umem addresses.
This is going to be used from the xsk skb destructor to put the umem
addresses onto pool's completion queue.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
Note: CC'ing Florian Westphal in case I have missed a relevant detail.
---
include/linux/skbuff.h | 3 +++
include/net/xdp_sock.h | 5 +++++
net/core/skbuff.c | 4 ++++
net/xdp/Kconfig | 1 +
4 files changed, 13 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fb3fec9affaa..1c4a598b6564 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4910,6 +4910,9 @@ enum skb_ext_id {
#endif
#if IS_ENABLED(CONFIG_INET_PSP)
SKB_EXT_PSP,
+#endif
+#if IS_ENABLED(CONFIG_XDP_SOCKETS)
+ SKB_EXT_XDP,
#endif
SKB_EXT_NUM, /* must be last */
};
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ce587a225661..94c607093768 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -120,6 +120,11 @@ struct xsk_tx_metadata_ops {
void (*tmo_request_launch_time)(u64 launch_time, void *priv);
};
+struct xdp_skb_ext {
+ u32 num_descs;
+ struct list_head addrs_list;
+};
+
#ifdef CONFIG_XDP_SOCKETS
int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6be01454f262..f3966b8c61ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -81,6 +81,7 @@
#include <net/page_pool/helpers.h>
#include <net/psp/types.h>
#include <net/dropreason.h>
+#include <net/xdp_sock.h>
#include <linux/uaccess.h>
#include <trace/events/skb.h>
@@ -5066,6 +5067,9 @@ static const u8 skb_ext_type_len[] = {
#if IS_ENABLED(CONFIG_INET_PSP)
[SKB_EXT_PSP] = SKB_EXT_CHUNKSIZEOF(struct psp_skb_ext),
#endif
+#if IS_ENABLED(CONFIG_XDP_SOCKETS)
+ [SKB_EXT_XDP] = SKB_EXT_CHUNKSIZEOF(struct xdp_skb_ext),
+#endif
};
static __always_inline unsigned int skb_ext_total_length(void)
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
index 71af2febe72a..89546c48ac2a 100644
--- a/net/xdp/Kconfig
+++ b/net/xdp/Kconfig
@@ -2,6 +2,7 @@
config XDP_SOCKETS
bool "XDP sockets"
depends on BPF_SYSCALL
+ select SKB_EXTENSIONS
default n
help
XDP sockets allows a channel between XDP programs and
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-28 18:30 [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Fernando Fernandez Mancera
@ 2025-10-28 18:30 ` Fernando Fernandez Mancera
2025-10-28 23:01 ` Jakub Kicinski
2025-10-30 1:05 ` Jason Xing
2025-10-28 22:55 ` [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Jakub Kicinski
1 sibling, 2 replies; 8+ messages in thread
From: Fernando Fernandez Mancera @ 2025-10-28 18:30 UTC (permalink / raw)
To: bpf
Cc: netdev, magnus.karlsson, maciej.fijalkowski, sdf, kerneljasonxing,
fw, Fernando Fernandez Mancera
Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
production"), the descriptor number is stored in skb control block and
xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
pool's completion queue.
skb control block shouldn't be used for this purpose as after transmit
xsk doesn't have control over it and other subsystems could use it. This
leads to the following kernel panic due to a NULL pointer dereference.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:xsk_destruct_skb+0xd0/0x180
[...]
Call Trace:
<IRQ>
? napi_complete_done+0x7a/0x1a0
ip_rcv_core+0x1bb/0x340
ip_rcv+0x30/0x1f0
__netif_receive_skb_one_core+0x85/0xa0
process_backlog+0x87/0x130
__napi_poll+0x28/0x180
net_rx_action+0x339/0x420
handle_softirqs+0xdc/0x320
? handle_edge_irq+0x90/0x1e0
do_softirq.part.0+0x3b/0x60
</IRQ>
<TASK>
__local_bh_enable_ip+0x60/0x70
__dev_direct_xmit+0x14e/0x1f0
__xsk_generic_xmit+0x482/0xb70
? __remove_hrtimer+0x41/0xa0
? __xsk_generic_xmit+0x51/0xb70
? _raw_spin_unlock_irqrestore+0xe/0x40
xsk_sendmsg+0xda/0x1c0
__sys_sendto+0x1ee/0x200
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x84/0x2f0
? __pfx_pollwake+0x10/0x10
? __rseq_handle_notify_resume+0xad/0x4c0
? restore_fpregs_from_fpstate+0x3c/0x90
? switch_fpu_return+0x5b/0xe0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[...]
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Use the skb XDP extension to store the number of cq descriptors along
with a list of umem addresses.
Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
v2: remove some leftovers on skb_build and simplify fragmented traffic
logic
---
net/xdp/xsk.c | 70 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 49 insertions(+), 21 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7b0c68a70888..7c04cadf79b9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,15 +41,8 @@ struct xsk_addr_node {
struct list_head addr_node;
};
-struct xsk_addr_head {
- u32 num_descs;
- struct list_head addrs_list;
-};
-
static struct kmem_cache *xsk_tx_generic_cache;
-#define XSKCB(skb) ((struct xsk_addr_head *)((skb)->cb))
-
void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
{
if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -562,6 +555,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
struct sk_buff *skb)
{
struct xsk_addr_node *pos, *tmp;
+ struct xdp_skb_ext *ext;
u32 descs_processed = 0;
unsigned long flags;
u32 idx;
@@ -573,14 +567,16 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
(u64)(uintptr_t)skb_shinfo(skb)->destructor_arg);
descs_processed++;
- if (unlikely(XSKCB(skb)->num_descs > 1)) {
- list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ if (unlikely(ext)) {
+ list_for_each_entry_safe(pos, tmp, &ext->addrs_list, addr_node) {
xskq_prod_write_addr(pool->cq, idx + descs_processed,
pos->addr);
descs_processed++;
list_del(&pos->addr_node);
kmem_cache_free(xsk_tx_generic_cache, pos);
}
+ skb_ext_del(skb, SKB_EXT_XDP);
}
xskq_prod_submit_n(pool->cq, descs_processed);
spin_unlock_irqrestore(&pool->cq_lock, flags);
@@ -597,12 +593,19 @@ static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
static void xsk_inc_num_desc(struct sk_buff *skb)
{
- XSKCB(skb)->num_descs++;
+ struct xdp_skb_ext *ext;
+
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ if (ext)
+ ext->num_descs++;
}
static u32 xsk_get_num_desc(struct sk_buff *skb)
{
- return XSKCB(skb)->num_descs;
+ struct xdp_skb_ext *ext;
+
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ return ext ? ext->num_descs : 1;
}
static void xsk_destruct_skb(struct sk_buff *skb)
@@ -621,12 +624,9 @@ static void xsk_destruct_skb(struct sk_buff *skb)
static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
u64 addr)
{
- BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
- INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
skb->dev = xs->dev;
skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
- XSKCB(skb)->num_descs = 0;
skb->destructor = xsk_destruct_skb;
skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
}
@@ -636,12 +636,15 @@ static void xsk_consume_skb(struct sk_buff *skb)
struct xdp_sock *xs = xdp_sk(skb->sk);
u32 num_descs = xsk_get_num_desc(skb);
struct xsk_addr_node *pos, *tmp;
+ struct xdp_skb_ext *ext;
- if (unlikely(num_descs > 1)) {
- list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ if (unlikely(ext)) {
+ list_for_each_entry_safe(pos, tmp, &ext->addrs_list, addr_node) {
list_del(&pos->addr_node);
kmem_cache_free(xsk_tx_generic_cache, pos);
}
+ skb_ext_del(skb, SKB_EXT_XDP);
}
skb->destructor = sock_wfree;
@@ -727,6 +730,18 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
return ERR_PTR(err);
}
} else {
+ struct xdp_skb_ext *ext;
+
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ if (!ext) {
+ ext = skb_ext_add(skb, SKB_EXT_XDP);
+ if (!ext)
+ return ERR_PTR(-ENOMEM);
+ memset(ext, 0, sizeof(*ext));
+ INIT_LIST_HEAD(&ext->addrs_list);
+ ext->num_descs = 1;
+ }
+
xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
if (!xsk_addr)
return ERR_PTR(-ENOMEM);
@@ -736,7 +751,8 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
* would be dropped, which implies freeing all list elements
*/
xsk_addr->addr = desc->addr;
- list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
+ list_add_tail(&xsk_addr->addr_node, &ext->addrs_list);
+ xsk_inc_num_desc(skb);
}
len = desc->len;
@@ -814,6 +830,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
} else {
int nr_frags = skb_shinfo(skb)->nr_frags;
struct xsk_addr_node *xsk_addr;
+ struct xdp_skb_ext *ext;
struct page *page;
u8 *vaddr;
@@ -828,6 +845,19 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
+ ext = skb_ext_find(skb, SKB_EXT_XDP);
+ if (!ext) {
+ ext = skb_ext_add(skb, SKB_EXT_XDP);
+ if (!ext) {
+ __free_page(page);
+ err = -ENOMEM;
+ goto free_err;
+ }
+ memset(ext, 0, sizeof(*ext));
+ INIT_LIST_HEAD(&ext->addrs_list);
+ ext->num_descs = 1;
+ }
+
xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
if (!xsk_addr) {
__free_page(page);
@@ -843,12 +873,11 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
xsk_addr->addr = desc->addr;
- list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
+ list_add_tail(&xsk_addr->addr_node, &ext->addrs_list);
+ xsk_inc_num_desc(skb);
}
}
- xsk_inc_num_desc(skb);
-
return skb;
free_err:
@@ -857,7 +886,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
if (err == -EOVERFLOW) {
/* Drop the packet */
- xsk_inc_num_desc(xs->skb);
xsk_drop_skb(xs->skb);
xskq_cons_release(xs->tx);
} else {
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2 bpf v2] xdp: add XDP extension to skb
2025-10-28 18:30 [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Fernando Fernandez Mancera
2025-10-28 18:30 ` [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
@ 2025-10-28 22:55 ` Jakub Kicinski
1 sibling, 0 replies; 8+ messages in thread
From: Jakub Kicinski @ 2025-10-28 22:55 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf,
kerneljasonxing, fw
On Tue, 28 Oct 2025 19:30:31 +0100 Fernando Fernandez Mancera wrote:
> This patch adds a new skb extension for XDP representing the number of
> cq descriptors and a linked list of umem addresses.
>
> This is going to be used from the xsk skb destructor to put the umem
> addresses onto pool's completion queue.
This increases the size of skb extensions and burns a bit in every skb.
Unacceptable for a fallback / slowpath.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-28 18:30 ` [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
@ 2025-10-28 23:01 ` Jakub Kicinski
2025-10-29 7:51 ` Fernando Fernandez Mancera
2025-10-30 1:05 ` Jason Xing
1 sibling, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2025-10-28 23:01 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf,
kerneljasonxing, fw
On Tue, 28 Oct 2025 19:30:32 +0100 Fernando Fernandez Mancera wrote:
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
Looking at the past discussion it sounds like you want to optimize
the single descriptor case? Can you not use a magic pointer for that?
#define XSK_DESTRUCT_SINGLE_BUF (void *)1
destructor_arg = XSK_DESTRUCT_SINGLE_BUF
Let's target this fix at net, please, I think the complexity here is
all in skbs paths.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-28 23:01 ` Jakub Kicinski
@ 2025-10-29 7:51 ` Fernando Fernandez Mancera
2025-10-29 23:22 ` Jakub Kicinski
0 siblings, 1 reply; 8+ messages in thread
From: Fernando Fernandez Mancera @ 2025-10-29 7:51 UTC (permalink / raw)
To: Jakub Kicinski
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf,
kerneljasonxing, fw
On 10/29/25 12:01 AM, Jakub Kicinski wrote:
> On Tue, 28 Oct 2025 19:30:32 +0100 Fernando Fernandez Mancera wrote:
>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
>> production"), the descriptor number is stored in skb control block and
>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
>> pool's completion queue.
>
> Looking at the past discussion it sounds like you want to optimize
> the single descriptor case? Can you not use a magic pointer for that?
>
> #define XSK_DESTRUCT_SINGLE_BUF (void *)1
> destructor_arg = XSK_DESTRUCT_SINGLE_BUF
>
> Let's target this fix at net, please, I think the complexity here is
> all in skbs paths.
I might be missing something here but if the destructor_arg pointer is
used to do this, where should we store the umem address associated with
it? In the proposed approach the skb extension should not be increased
for non-fragmented traffic as there is only a single descriptor and
therefore we can store the umem address in destructor_arg directly.
The size of the skb extension will only increase for fragmented traffic
(multiple descriptors).. but sure, if there is a fallback to the
slowpath, it will burden a bit the performance. Although, for that to
happen the must have tried to use AF_XDP family initially.. AFAICS, the
size of skb extension is only increased when skb_ext_add() is called.
Thanks,
Fernando.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-29 7:51 ` Fernando Fernandez Mancera
@ 2025-10-29 23:22 ` Jakub Kicinski
2025-10-30 8:38 ` Fernando Fernandez Mancera
0 siblings, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2025-10-29 23:22 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf,
kerneljasonxing, fw
On Wed, 29 Oct 2025 08:51:58 +0100 Fernando Fernandez Mancera wrote:
> On 10/29/25 12:01 AM, Jakub Kicinski wrote:
> > On Tue, 28 Oct 2025 19:30:32 +0100 Fernando Fernandez Mancera wrote:
> >> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> >> production"), the descriptor number is stored in skb control block and
> >> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> >> pool's completion queue.
> >
> > Looking at the past discussion it sounds like you want to optimize
> > the single descriptor case? Can you not use a magic pointer for that?
> >
> > #define XSK_DESTRUCT_SINGLE_BUF (void *)1
> > destructor_arg = XSK_DESTRUCT_SINGLE_BUF
> >
> > Let's target this fix at net, please, I think the complexity here is
> > all in skbs paths.
>
> I might be missing something here but if the destructor_arg pointer is
> used to do this, where should we store the umem address associated with
> it? In the proposed approach the skb extension should not be increased
> for non-fragmented traffic as there is only a single descriptor and
> therefore we can store the umem address in destructor_arg directly.
I see. Pointers are always aligned to 8B, you can stash the "pointer
type" there. If the bottom bit is 1 it's a umem and the skb was
single-chunk. If it's non-0 then it's a full kmalloc'ed struct.
> The size of the skb extension will only increase for fragmented traffic
> (multiple descriptors).. but sure, if there is a fallback to the
> slowpath, it will burden a bit the performance. Although, for that to
> happen the must have tried to use AF_XDP family initially.. AFAICS, the
> size of skb extension is only increased when skb_ext_add() is called.
To be clear by adding an skb extension you are de-facto allocating
a bit in the skb struct. Just one of the bits of the active_extensions
field instead of a separate bitfield. If you can depend on the socket
association instead this is quite wasteful.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-28 18:30 ` [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-10-28 23:01 ` Jakub Kicinski
@ 2025-10-30 1:05 ` Jason Xing
1 sibling, 0 replies; 8+ messages in thread
From: Jason Xing @ 2025-10-30 1:05 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf, fw
On Wed, Oct 29, 2025 at 2:30 AM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
>
> skb control block shouldn't be used for this purpose as after transmit
> xsk doesn't have control over it and other subsystems could use it. This
> leads to the following kernel panic due to a NULL pointer dereference.
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> [...]
> Call Trace:
> <IRQ>
> ? napi_complete_done+0x7a/0x1a0
> ip_rcv_core+0x1bb/0x340
> ip_rcv+0x30/0x1f0
> __netif_receive_skb_one_core+0x85/0xa0
> process_backlog+0x87/0x130
> __napi_poll+0x28/0x180
> net_rx_action+0x339/0x420
> handle_softirqs+0xdc/0x320
> ? handle_edge_irq+0x90/0x1e0
> do_softirq.part.0+0x3b/0x60
> </IRQ>
> <TASK>
> __local_bh_enable_ip+0x60/0x70
> __dev_direct_xmit+0x14e/0x1f0
> __xsk_generic_xmit+0x482/0xb70
> ? __remove_hrtimer+0x41/0xa0
> ? __xsk_generic_xmit+0x51/0xb70
> ? _raw_spin_unlock_irqrestore+0xe/0x40
> xsk_sendmsg+0xda/0x1c0
> __sys_sendto+0x1ee/0x200
> __x64_sys_sendto+0x24/0x30
> do_syscall_64+0x84/0x2f0
> ? __pfx_pollwake+0x10/0x10
> ? __rseq_handle_notify_resume+0xad/0x4c0
> ? restore_fpregs_from_fpstate+0x3c/0x90
> ? switch_fpu_return+0x5b/0xe0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> </TASK>
> [...]
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Use the skb XDP extension to store the number of cq descriptors along
> with a list of umem addresses.
>
> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
It might be off topic a bit. I keep wondering if we can find another
solution to fix the issue that the above patches tries to fix: how to
avoid the wrong fallback in that corner case. It's unlikely to happen
but the cost of correcting it is high because of adding memory
allocation in the hot path and freeing memory when the irq is disabled
in the irq context... Sure, for the latter, we can implement a
workqueue to handle this to mitigate the impact. Here I have no
intention of expanding another different topic.
Adding additional memory allocation or something like that is really
not a promising direction we should take. In high performance features
like copy mode in xsk, the biggest challenge is memory operations and
the second one is various locks.
Thanks,
Jason
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number
2025-10-29 23:22 ` Jakub Kicinski
@ 2025-10-30 8:38 ` Fernando Fernandez Mancera
0 siblings, 0 replies; 8+ messages in thread
From: Fernando Fernandez Mancera @ 2025-10-30 8:38 UTC (permalink / raw)
To: Jakub Kicinski
Cc: bpf, netdev, magnus.karlsson, maciej.fijalkowski, sdf,
kerneljasonxing, fw
On 10/30/25 12:22 AM, Jakub Kicinski wrote:
> On Wed, 29 Oct 2025 08:51:58 +0100 Fernando Fernandez Mancera wrote:
>> On 10/29/25 12:01 AM, Jakub Kicinski wrote:
>>> On Tue, 28 Oct 2025 19:30:32 +0100 Fernando Fernandez Mancera wrote:
>>>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
>>>> production"), the descriptor number is stored in skb control block and
>>>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
>>>> pool's completion queue.
>>>
>>> Looking at the past discussion it sounds like you want to optimize
>>> the single descriptor case? Can you not use a magic pointer for that?
>>>
>>> #define XSK_DESTRUCT_SINGLE_BUF (void *)1
>>> destructor_arg = XSK_DESTRUCT_SINGLE_BUF
>>>
>>> Let's target this fix at net, please, I think the complexity here is
>>> all in skbs paths.
>>
>> I might be missing something here but if the destructor_arg pointer is
>> used to do this, where should we store the umem address associated with
>> it? In the proposed approach the skb extension should not be increased
>> for non-fragmented traffic as there is only a single descriptor and
>> therefore we can store the umem address in destructor_arg directly.
>
> I see. Pointers are always aligned to 8B, you can stash the "pointer
> type" there. If the bottom bit is 1 it's a umem and the skb was
> single-chunk. If it's non-0 then it's a full kmalloc'ed struct.
>
That is a good point. Pointer tagging might be a good solution here.
Thanks, let me try that.
>> The size of the skb extension will only increase for fragmented traffic
>> (multiple descriptors).. but sure, if there is a fallback to the
>> slowpath, it will burden a bit the performance. Although, for that to
>> happen the must have tried to use AF_XDP family initially.. AFAICS, the
>> size of skb extension is only increased when skb_ext_add() is called.
>
> To be clear by adding an skb extension you are de-facto allocating
> a bit in the skb struct. Just one of the bits of the active_extensions
> field instead of a separate bitfield. If you can depend on the socket
> association instead this is quite wasteful.
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-10-30 8:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 18:30 [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Fernando Fernandez Mancera
2025-10-28 18:30 ` [PATCH 2/2 bpf v2] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-10-28 23:01 ` Jakub Kicinski
2025-10-29 7:51 ` Fernando Fernandez Mancera
2025-10-29 23:22 ` Jakub Kicinski
2025-10-30 8:38 ` Fernando Fernandez Mancera
2025-10-30 1:05 ` Jason Xing
2025-10-28 22:55 ` [PATCH 1/2 bpf v2] xdp: add XDP extension to skb Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).