* [PATCH net v6] xsk: avoid data corruption on cq descriptor number
@ 2025-11-24 17:14 Fernando Fernandez Mancera
2025-11-24 23:03 ` Maciej Fijalkowski
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Fernando Fernandez Mancera @ 2025-11-24 17:14 UTC (permalink / raw)
To: netdev
Cc: csmate, kerneljasonxing, maciej.fijalkowski, bpf, davem, edumazet,
kuba, pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson, Fernando Fernandez Mancera
Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
production"), the descriptor number is stored in skb control block and
xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
pool's completion queue.
skb control block shouldn't be used for this purpose as after transmit
xsk doesn't have control over it and other subsystems could use it. This
leads to the following kernel panic due to a NULL pointer dereference.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:xsk_destruct_skb+0xd0/0x180
[...]
Call Trace:
<IRQ>
? napi_complete_done+0x7a/0x1a0
ip_rcv_core+0x1bb/0x340
ip_rcv+0x30/0x1f0
__netif_receive_skb_one_core+0x85/0xa0
process_backlog+0x87/0x130
__napi_poll+0x28/0x180
net_rx_action+0x339/0x420
handle_softirqs+0xdc/0x320
? handle_edge_irq+0x90/0x1e0
do_softirq.part.0+0x3b/0x60
</IRQ>
<TASK>
__local_bh_enable_ip+0x60/0x70
__dev_direct_xmit+0x14e/0x1f0
__xsk_generic_xmit+0x482/0xb70
? __remove_hrtimer+0x41/0xa0
? __xsk_generic_xmit+0x51/0xb70
? _raw_spin_unlock_irqrestore+0xe/0x40
xsk_sendmsg+0xda/0x1c0
__sys_sendto+0x1ee/0x200
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x84/0x2f0
? __pfx_pollwake+0x10/0x10
? __rseq_handle_notify_resume+0xad/0x4c0
? restore_fpregs_from_fpstate+0x3c/0x90
? switch_fpu_return+0x5b/0xe0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[...]
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Instead use the skb destructor_arg pointer along with pointer tagging.
As pointers are always aligned to 8B, use the bottom bit to indicate
whether this a single address or an allocated struct containing several
addresses.
Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
v2: remove some leftovers on skb_build and simplify fragmented traffic
logic
v3: drop skb extension approach, instead use pointer tagging in
destructor_arg to know whether we have a single address or an allocated
struct with multiple ones. Also, move from bpf to net as requested
v4: repost after rebasing
v5: fixed increase logic so -EOVERFLOW is handled correctly as
suggested by Jason. Also dropped the acks/reviewed tags as code changed.
v6: added helper xsk_skb_destructor_set_addr() and remove unnecessary
if statement checking if destructor is addr before increasing as it is
already at the helper.
---
net/xdp/xsk.c | 143 +++++++++++++++++++++++++++++++-------------------
1 file changed, 88 insertions(+), 55 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7b0c68a70888..69bbcca8ac75 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -36,20 +36,13 @@
#define TX_BATCH_SIZE 32
#define MAX_PER_SOCKET_BUDGET 32
-struct xsk_addr_node {
- u64 addr;
- struct list_head addr_node;
-};
-
-struct xsk_addr_head {
+struct xsk_addrs {
u32 num_descs;
- struct list_head addrs_list;
+ u64 addrs[MAX_SKB_FRAGS + 1];
};
static struct kmem_cache *xsk_tx_generic_cache;
-#define XSKCB(skb) ((struct xsk_addr_head *)((skb)->cb))
-
void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
{
if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -558,29 +551,68 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
return ret;
}
+static bool xsk_skb_destructor_is_addr(struct sk_buff *skb)
+{
+ return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
+}
+
+static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
+{
+ return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
+}
+
+static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
+{
+ skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
+}
+
+static void xsk_inc_num_desc(struct sk_buff *skb)
+{
+ struct xsk_addrs *xsk_addr;
+
+ if (!xsk_skb_destructor_is_addr(skb)) {
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ xsk_addr->num_descs++;
+ }
+}
+
+static u32 xsk_get_num_desc(struct sk_buff *skb)
+{
+ struct xsk_addrs *xsk_addr;
+
+ if (xsk_skb_destructor_is_addr(skb))
+ return 1;
+
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+
+ return xsk_addr->num_descs;
+}
+
static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
struct sk_buff *skb)
{
- struct xsk_addr_node *pos, *tmp;
+ u32 num_descs = xsk_get_num_desc(skb);
+ struct xsk_addrs *xsk_addr;
u32 descs_processed = 0;
unsigned long flags;
- u32 idx;
+ u32 idx, i;
spin_lock_irqsave(&pool->cq_lock, flags);
idx = xskq_get_prod(pool->cq);
- xskq_prod_write_addr(pool->cq, idx,
- (u64)(uintptr_t)skb_shinfo(skb)->destructor_arg);
- descs_processed++;
+ if (unlikely(num_descs > 1)) {
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
- if (unlikely(XSKCB(skb)->num_descs > 1)) {
- list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
+ for (i = 0; i < num_descs; i++) {
xskq_prod_write_addr(pool->cq, idx + descs_processed,
- pos->addr);
+ xsk_addr->addrs[i]);
descs_processed++;
- list_del(&pos->addr_node);
- kmem_cache_free(xsk_tx_generic_cache, pos);
}
+ kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
+ } else {
+ xskq_prod_write_addr(pool->cq, idx,
+ xsk_skb_destructor_get_addr(skb));
+ descs_processed++;
}
xskq_prod_submit_n(pool->cq, descs_processed);
spin_unlock_irqrestore(&pool->cq_lock, flags);
@@ -595,16 +627,6 @@ static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
spin_unlock_irqrestore(&pool->cq_lock, flags);
}
-static void xsk_inc_num_desc(struct sk_buff *skb)
-{
- XSKCB(skb)->num_descs++;
-}
-
-static u32 xsk_get_num_desc(struct sk_buff *skb)
-{
- return XSKCB(skb)->num_descs;
-}
-
static void xsk_destruct_skb(struct sk_buff *skb)
{
struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
@@ -621,27 +643,22 @@ static void xsk_destruct_skb(struct sk_buff *skb)
static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
u64 addr)
{
- BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
- INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
skb->dev = xs->dev;
skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
- XSKCB(skb)->num_descs = 0;
skb->destructor = xsk_destruct_skb;
- skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
+ xsk_skb_destructor_set_addr(skb, addr);
}
static void xsk_consume_skb(struct sk_buff *skb)
{
struct xdp_sock *xs = xdp_sk(skb->sk);
u32 num_descs = xsk_get_num_desc(skb);
- struct xsk_addr_node *pos, *tmp;
+ struct xsk_addrs *xsk_addr;
if (unlikely(num_descs > 1)) {
- list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
- list_del(&pos->addr_node);
- kmem_cache_free(xsk_tx_generic_cache, pos);
- }
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
}
skb->destructor = sock_wfree;
@@ -701,7 +718,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
{
struct xsk_buff_pool *pool = xs->pool;
u32 hr, len, ts, offset, copy, copied;
- struct xsk_addr_node *xsk_addr;
struct sk_buff *skb = xs->skb;
struct page *page;
void *buffer;
@@ -727,16 +743,26 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
return ERR_PTR(err);
}
} else {
- xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
- if (!xsk_addr)
- return ERR_PTR(-ENOMEM);
+ struct xsk_addrs *xsk_addr;
+
+ if (xsk_skb_destructor_is_addr(skb)) {
+ xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
+ GFP_KERNEL);
+ if (!xsk_addr)
+ return ERR_PTR(-ENOMEM);
+
+ xsk_addr->num_descs = 1;
+ xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
+ skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+ } else {
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ }
/* in case of -EOVERFLOW that could happen below,
* xsk_consume_skb() will release this node as whole skb
* would be dropped, which implies freeing all list elements
*/
- xsk_addr->addr = desc->addr;
- list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
+ xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
}
len = desc->len;
@@ -813,10 +839,25 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
} else {
int nr_frags = skb_shinfo(skb)->nr_frags;
- struct xsk_addr_node *xsk_addr;
+ struct xsk_addrs *xsk_addr;
struct page *page;
u8 *vaddr;
+ if (xsk_skb_destructor_is_addr(skb)) {
+ xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
+ GFP_KERNEL);
+ if (!xsk_addr) {
+ err = -ENOMEM;
+ goto free_err;
+ }
+
+ xsk_addr->num_descs = 1;
+ xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
+ skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
+ } else {
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
+ }
+
if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
err = -EOVERFLOW;
goto free_err;
@@ -828,13 +869,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
- xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
- if (!xsk_addr) {
- __free_page(page);
- err = -ENOMEM;
- goto free_err;
- }
-
vaddr = kmap_local_page(page);
memcpy(vaddr, buffer, len);
kunmap_local(vaddr);
@@ -842,8 +876,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
- xsk_addr->addr = desc->addr;
- list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
+ xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
}
}
@@ -1904,7 +1937,7 @@ static int __init xsk_init(void)
goto out_pernet;
xsk_tx_generic_cache = kmem_cache_create("xsk_generic_xmit_cache",
- sizeof(struct xsk_addr_node),
+ sizeof(struct xsk_addrs),
0, SLAB_HWCACHE_ALIGN, NULL);
if (!xsk_tx_generic_cache) {
err = -ENOMEM;
--
2.51.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-24 17:14 [PATCH net v6] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
@ 2025-11-24 23:03 ` Maciej Fijalkowski
2025-11-24 23:41 ` Jason Xing
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Maciej Fijalkowski @ 2025-11-24 23:03 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, csmate, kerneljasonxing, bpf, davem, edumazet, kuba,
pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Mon, Nov 24, 2025 at 06:14:09PM +0100, Fernando Fernandez Mancera wrote:
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
>
> skb control block shouldn't be used for this purpose as after transmit
> xsk doesn't have control over it and other subsystems could use it. This
> leads to the following kernel panic due to a NULL pointer dereference.
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> [...]
> Call Trace:
> <IRQ>
> ? napi_complete_done+0x7a/0x1a0
> ip_rcv_core+0x1bb/0x340
> ip_rcv+0x30/0x1f0
> __netif_receive_skb_one_core+0x85/0xa0
> process_backlog+0x87/0x130
> __napi_poll+0x28/0x180
> net_rx_action+0x339/0x420
> handle_softirqs+0xdc/0x320
> ? handle_edge_irq+0x90/0x1e0
> do_softirq.part.0+0x3b/0x60
> </IRQ>
> <TASK>
> __local_bh_enable_ip+0x60/0x70
> __dev_direct_xmit+0x14e/0x1f0
> __xsk_generic_xmit+0x482/0xb70
> ? __remove_hrtimer+0x41/0xa0
> ? __xsk_generic_xmit+0x51/0xb70
> ? _raw_spin_unlock_irqrestore+0xe/0x40
> xsk_sendmsg+0xda/0x1c0
> __sys_sendto+0x1ee/0x200
> __x64_sys_sendto+0x24/0x30
> do_syscall_64+0x84/0x2f0
> ? __pfx_pollwake+0x10/0x10
> ? __rseq_handle_notify_resume+0xad/0x4c0
> ? restore_fpregs_from_fpstate+0x3c/0x90
> ? switch_fpu_return+0x5b/0xe0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> </TASK>
> [...]
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Instead use the skb destructor_arg pointer along with pointer tagging.
> As pointers are always aligned to 8B, use the bottom bit to indicate
> whether this a single address or an allocated struct containing several
> addresses.
>
> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
> v2: remove some leftovers on skb_build and simplify fragmented traffic
> logic
>
> v3: drop skb extension approach, instead use pointer tagging in
> destructor_arg to know whether we have a single address or an allocated
> struct with multiple ones. Also, move from bpf to net as requested
>
> v4: repost after rebasing
>
> v5: fixed increase logic so -EOVERFLOW is handled correctly as
> suggested by Jason. Also dropped the acks/reviewed tags as code changed.
>
> v6: added helper xsk_skb_destructor_set_addr() and remove unnecessary
> if statement checking if destructor is addr before increasing as it is
> already at the helper.
> ---
> net/xdp/xsk.c | 143 +++++++++++++++++++++++++++++++-------------------
> 1 file changed, 88 insertions(+), 55 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 7b0c68a70888..69bbcca8ac75 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -36,20 +36,13 @@
> #define TX_BATCH_SIZE 32
> #define MAX_PER_SOCKET_BUDGET 32
>
> -struct xsk_addr_node {
> - u64 addr;
> - struct list_head addr_node;
> -};
> -
> -struct xsk_addr_head {
> +struct xsk_addrs {
> u32 num_descs;
> - struct list_head addrs_list;
> + u64 addrs[MAX_SKB_FRAGS + 1];
> };
>
> static struct kmem_cache *xsk_tx_generic_cache;
>
> -#define XSKCB(skb) ((struct xsk_addr_head *)((skb)->cb))
> -
> void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> {
> if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
> @@ -558,29 +551,68 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
> return ret;
> }
>
> +static bool xsk_skb_destructor_is_addr(struct sk_buff *skb)
> +{
> + return (uintptr_t)skb_shinfo(skb)->destructor_arg & 0x1UL;
> +}
> +
> +static u64 xsk_skb_destructor_get_addr(struct sk_buff *skb)
> +{
> + return (u64)((uintptr_t)skb_shinfo(skb)->destructor_arg & ~0x1UL);
> +}
> +
> +static void xsk_skb_destructor_set_addr(struct sk_buff *skb, u64 addr)
> +{
> + skb_shinfo(skb)->destructor_arg = (void *)((uintptr_t)addr | 0x1UL);
> +}
> +
> +static void xsk_inc_num_desc(struct sk_buff *skb)
> +{
> + struct xsk_addrs *xsk_addr;
> +
> + if (!xsk_skb_destructor_is_addr(skb)) {
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> + xsk_addr->num_descs++;
> + }
> +}
> +
> +static u32 xsk_get_num_desc(struct sk_buff *skb)
> +{
> + struct xsk_addrs *xsk_addr;
> +
> + if (xsk_skb_destructor_is_addr(skb))
> + return 1;
> +
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> +
> + return xsk_addr->num_descs;
> +}
> +
> static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
> struct sk_buff *skb)
> {
> - struct xsk_addr_node *pos, *tmp;
> + u32 num_descs = xsk_get_num_desc(skb);
> + struct xsk_addrs *xsk_addr;
> u32 descs_processed = 0;
> unsigned long flags;
> - u32 idx;
> + u32 idx, i;
>
> spin_lock_irqsave(&pool->cq_lock, flags);
> idx = xskq_get_prod(pool->cq);
>
> - xskq_prod_write_addr(pool->cq, idx,
> - (u64)(uintptr_t)skb_shinfo(skb)->destructor_arg);
> - descs_processed++;
> + if (unlikely(num_descs > 1)) {
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
>
> - if (unlikely(XSKCB(skb)->num_descs > 1)) {
> - list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
> + for (i = 0; i < num_descs; i++) {
> xskq_prod_write_addr(pool->cq, idx + descs_processed,
> - pos->addr);
> + xsk_addr->addrs[i]);
> descs_processed++;
> - list_del(&pos->addr_node);
> - kmem_cache_free(xsk_tx_generic_cache, pos);
> }
> + kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
> + } else {
> + xskq_prod_write_addr(pool->cq, idx,
> + xsk_skb_destructor_get_addr(skb));
> + descs_processed++;
> }
> xskq_prod_submit_n(pool->cq, descs_processed);
> spin_unlock_irqrestore(&pool->cq_lock, flags);
> @@ -595,16 +627,6 @@ static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
> spin_unlock_irqrestore(&pool->cq_lock, flags);
> }
>
> -static void xsk_inc_num_desc(struct sk_buff *skb)
> -{
> - XSKCB(skb)->num_descs++;
> -}
> -
> -static u32 xsk_get_num_desc(struct sk_buff *skb)
> -{
> - return XSKCB(skb)->num_descs;
> -}
> -
> static void xsk_destruct_skb(struct sk_buff *skb)
> {
> struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> @@ -621,27 +643,22 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> static void xsk_skb_init_misc(struct sk_buff *skb, struct xdp_sock *xs,
> u64 addr)
> {
> - BUILD_BUG_ON(sizeof(struct xsk_addr_head) > sizeof(skb->cb));
> - INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> skb->dev = xs->dev;
> skb->priority = READ_ONCE(xs->sk.sk_priority);
> skb->mark = READ_ONCE(xs->sk.sk_mark);
> - XSKCB(skb)->num_descs = 0;
> skb->destructor = xsk_destruct_skb;
> - skb_shinfo(skb)->destructor_arg = (void *)(uintptr_t)addr;
> + xsk_skb_destructor_set_addr(skb, addr);
> }
>
> static void xsk_consume_skb(struct sk_buff *skb)
> {
> struct xdp_sock *xs = xdp_sk(skb->sk);
> u32 num_descs = xsk_get_num_desc(skb);
> - struct xsk_addr_node *pos, *tmp;
> + struct xsk_addrs *xsk_addr;
>
> if (unlikely(num_descs > 1)) {
> - list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
> - list_del(&pos->addr_node);
> - kmem_cache_free(xsk_tx_generic_cache, pos);
> - }
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> + kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
> }
>
> skb->destructor = sock_wfree;
> @@ -701,7 +718,6 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> {
> struct xsk_buff_pool *pool = xs->pool;
> u32 hr, len, ts, offset, copy, copied;
> - struct xsk_addr_node *xsk_addr;
> struct sk_buff *skb = xs->skb;
> struct page *page;
> void *buffer;
> @@ -727,16 +743,26 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> return ERR_PTR(err);
> }
> } else {
> - xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> - if (!xsk_addr)
> - return ERR_PTR(-ENOMEM);
> + struct xsk_addrs *xsk_addr;
> +
> + if (xsk_skb_destructor_is_addr(skb)) {
> + xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
> + GFP_KERNEL);
> + if (!xsk_addr)
> + return ERR_PTR(-ENOMEM);
> +
> + xsk_addr->num_descs = 1;
> + xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
> + skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
> + } else {
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> + }
>
> /* in case of -EOVERFLOW that could happen below,
> * xsk_consume_skb() will release this node as whole skb
> * would be dropped, which implies freeing all list elements
> */
> - xsk_addr->addr = desc->addr;
> - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> + xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> }
>
> len = desc->len;
> @@ -813,10 +839,25 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> }
> } else {
> int nr_frags = skb_shinfo(skb)->nr_frags;
> - struct xsk_addr_node *xsk_addr;
> + struct xsk_addrs *xsk_addr;
> struct page *page;
> u8 *vaddr;
>
> + if (xsk_skb_destructor_is_addr(skb)) {
> + xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache,
> + GFP_KERNEL);
> + if (!xsk_addr) {
> + err = -ENOMEM;
> + goto free_err;
> + }
> +
> + xsk_addr->num_descs = 1;
> + xsk_addr->addrs[0] = xsk_skb_destructor_get_addr(skb);
> + skb_shinfo(skb)->destructor_arg = (void *)xsk_addr;
> + } else {
> + xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> + }
> +
> if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
> err = -EOVERFLOW;
> goto free_err;
> @@ -828,13 +869,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> goto free_err;
> }
>
> - xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> - if (!xsk_addr) {
> - __free_page(page);
> - err = -ENOMEM;
> - goto free_err;
> - }
> -
> vaddr = kmap_local_page(page);
> memcpy(vaddr, buffer, len);
> kunmap_local(vaddr);
> @@ -842,8 +876,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
> refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
>
> - xsk_addr->addr = desc->addr;
> - list_add_tail(&xsk_addr->addr_node, &XSKCB(skb)->addrs_list);
> + xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> }
> }
>
> @@ -1904,7 +1937,7 @@ static int __init xsk_init(void)
> goto out_pernet;
>
> xsk_tx_generic_cache = kmem_cache_create("xsk_generic_xmit_cache",
> - sizeof(struct xsk_addr_node),
> + sizeof(struct xsk_addrs),
> 0, SLAB_HWCACHE_ALIGN, NULL);
> if (!xsk_tx_generic_cache) {
> err = -ENOMEM;
> --
> 2.51.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-24 17:14 [PATCH net v6] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-11-24 23:03 ` Maciej Fijalkowski
@ 2025-11-24 23:41 ` Jason Xing
2025-11-25 11:40 ` Fernando Fernandez Mancera
2025-11-26 4:00 ` patchwork-bot+netdevbpf
2025-11-26 11:42 ` [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge Matthieu Baerts
3 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2025-11-24 23:41 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, csmate, maciej.fijalkowski, bpf, davem, edumazet, kuba,
pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
>
> skb control block shouldn't be used for this purpose as after transmit
> xsk doesn't have control over it and other subsystems could use it. This
> leads to the following kernel panic due to a NULL pointer dereference.
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> [...]
> Call Trace:
> <IRQ>
> ? napi_complete_done+0x7a/0x1a0
> ip_rcv_core+0x1bb/0x340
> ip_rcv+0x30/0x1f0
> __netif_receive_skb_one_core+0x85/0xa0
> process_backlog+0x87/0x130
> __napi_poll+0x28/0x180
> net_rx_action+0x339/0x420
> handle_softirqs+0xdc/0x320
> ? handle_edge_irq+0x90/0x1e0
> do_softirq.part.0+0x3b/0x60
> </IRQ>
> <TASK>
> __local_bh_enable_ip+0x60/0x70
> __dev_direct_xmit+0x14e/0x1f0
> __xsk_generic_xmit+0x482/0xb70
> ? __remove_hrtimer+0x41/0xa0
> ? __xsk_generic_xmit+0x51/0xb70
> ? _raw_spin_unlock_irqrestore+0xe/0x40
> xsk_sendmsg+0xda/0x1c0
> __sys_sendto+0x1ee/0x200
> __x64_sys_sendto+0x24/0x30
> do_syscall_64+0x84/0x2f0
> ? __pfx_pollwake+0x10/0x10
> ? __rseq_handle_notify_resume+0xad/0x4c0
> ? restore_fpregs_from_fpstate+0x3c/0x90
> ? switch_fpu_return+0x5b/0xe0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> ? do_syscall_64+0x204/0x2f0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> </TASK>
> [...]
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Instead use the skb destructor_arg pointer along with pointer tagging.
> As pointers are always aligned to 8B, use the bottom bit to indicate
> whether this a single address or an allocated struct containing several
> addresses.
>
> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Could you also post a patch on top of net-next as it has diverged from
the net tree?
Thanks,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-24 23:41 ` Jason Xing
@ 2025-11-25 11:40 ` Fernando Fernandez Mancera
2025-11-25 12:11 ` Jason Xing
0 siblings, 1 reply; 12+ messages in thread
From: Fernando Fernandez Mancera @ 2025-11-25 11:40 UTC (permalink / raw)
To: Jason Xing
Cc: netdev, csmate, maciej.fijalkowski, bpf, davem, edumazet, kuba,
pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On 11/25/25 12:41 AM, Jason Xing wrote:
> On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
>>
>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
>> production"), the descriptor number is stored in skb control block and
>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
>> pool's completion queue.
>>
>> skb control block shouldn't be used for this purpose as after transmit
>> xsk doesn't have control over it and other subsystems could use it. This
>> leads to the following kernel panic due to a NULL pointer dereference.
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor read access in kernel mode
>> #PF: error_code(0x0000) - not-present page
>> PGD 0 P4D 0
>> Oops: Oops: 0000 [#1] SMP NOPTI
>> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
>> RIP: 0010:xsk_destruct_skb+0xd0/0x180
>> [...]
>> Call Trace:
>> <IRQ>
>> ? napi_complete_done+0x7a/0x1a0
>> ip_rcv_core+0x1bb/0x340
>> ip_rcv+0x30/0x1f0
>> __netif_receive_skb_one_core+0x85/0xa0
>> process_backlog+0x87/0x130
>> __napi_poll+0x28/0x180
>> net_rx_action+0x339/0x420
>> handle_softirqs+0xdc/0x320
>> ? handle_edge_irq+0x90/0x1e0
>> do_softirq.part.0+0x3b/0x60
>> </IRQ>
>> <TASK>
>> __local_bh_enable_ip+0x60/0x70
>> __dev_direct_xmit+0x14e/0x1f0
>> __xsk_generic_xmit+0x482/0xb70
>> ? __remove_hrtimer+0x41/0xa0
>> ? __xsk_generic_xmit+0x51/0xb70
>> ? _raw_spin_unlock_irqrestore+0xe/0x40
>> xsk_sendmsg+0xda/0x1c0
>> __sys_sendto+0x1ee/0x200
>> __x64_sys_sendto+0x24/0x30
>> do_syscall_64+0x84/0x2f0
>> ? __pfx_pollwake+0x10/0x10
>> ? __rseq_handle_notify_resume+0xad/0x4c0
>> ? restore_fpregs_from_fpstate+0x3c/0x90
>> ? switch_fpu_return+0x5b/0xe0
>> ? do_syscall_64+0x204/0x2f0
>> ? do_syscall_64+0x204/0x2f0
>> ? do_syscall_64+0x204/0x2f0
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> </TASK>
>> [...]
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Instead use the skb destructor_arg pointer along with pointer tagging.
>> As pointers are always aligned to 8B, use the bottom bit to indicate
>> whether this a single address or an allocated struct containing several
>> addresses.
>>
>> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
>> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
>> Suggested-by: Jakub Kicinski <kuba@kernel.org>
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Could you also post a patch on top of net-next as it has diverged from
> the net tree?
>
I think that is handled by maintainers when merging the branches. A
repost would be wrong because linux-next.git and linux.git will have a
different variant of the same commit..
Please, let me know if I am wrong here.
Thanks,
Fernando.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-25 11:40 ` Fernando Fernandez Mancera
@ 2025-11-25 12:11 ` Jason Xing
2025-11-25 16:31 ` Maciej Fijalkowski
0 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2025-11-25 12:11 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, csmate, maciej.fijalkowski, bpf, davem, edumazet, kuba,
pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Tue, Nov 25, 2025 at 7:40 PM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> On 11/25/25 12:41 AM, Jason Xing wrote:
> > On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
> > <fmancera@suse.de> wrote:
> >>
> >> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> >> production"), the descriptor number is stored in skb control block and
> >> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> >> pool's completion queue.
> >>
> >> skb control block shouldn't be used for this purpose as after transmit
> >> xsk doesn't have control over it and other subsystems could use it. This
> >> leads to the following kernel panic due to a NULL pointer dereference.
> >>
> >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> #PF: supervisor read access in kernel mode
> >> #PF: error_code(0x0000) - not-present page
> >> PGD 0 P4D 0
> >> Oops: Oops: 0000 [#1] SMP NOPTI
> >> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> >> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> >> [...]
> >> Call Trace:
> >> <IRQ>
> >> ? napi_complete_done+0x7a/0x1a0
> >> ip_rcv_core+0x1bb/0x340
> >> ip_rcv+0x30/0x1f0
> >> __netif_receive_skb_one_core+0x85/0xa0
> >> process_backlog+0x87/0x130
> >> __napi_poll+0x28/0x180
> >> net_rx_action+0x339/0x420
> >> handle_softirqs+0xdc/0x320
> >> ? handle_edge_irq+0x90/0x1e0
> >> do_softirq.part.0+0x3b/0x60
> >> </IRQ>
> >> <TASK>
> >> __local_bh_enable_ip+0x60/0x70
> >> __dev_direct_xmit+0x14e/0x1f0
> >> __xsk_generic_xmit+0x482/0xb70
> >> ? __remove_hrtimer+0x41/0xa0
> >> ? __xsk_generic_xmit+0x51/0xb70
> >> ? _raw_spin_unlock_irqrestore+0xe/0x40
> >> xsk_sendmsg+0xda/0x1c0
> >> __sys_sendto+0x1ee/0x200
> >> __x64_sys_sendto+0x24/0x30
> >> do_syscall_64+0x84/0x2f0
> >> ? __pfx_pollwake+0x10/0x10
> >> ? __rseq_handle_notify_resume+0xad/0x4c0
> >> ? restore_fpregs_from_fpstate+0x3c/0x90
> >> ? switch_fpu_return+0x5b/0xe0
> >> ? do_syscall_64+0x204/0x2f0
> >> ? do_syscall_64+0x204/0x2f0
> >> ? do_syscall_64+0x204/0x2f0
> >> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >> </TASK>
> >> [...]
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>
> >> Instead use the skb destructor_arg pointer along with pointer tagging.
> >> As pointers are always aligned to 8B, use the bottom bit to indicate
> >> whether this a single address or an allocated struct containing several
> >> addresses.
> >>
> >> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> >> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> >> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> >> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> >
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > Could you also post a patch on top of net-next as it has diverged from
> > the net tree?
> >
>
> I think that is handled by maintainers when merging the branches. A
> repost would be wrong because linux-next.git and linux.git will have a
> different variant of the same commit..
But this patch cannot be applied cleanly in the net-next tree...
>
> Please, let me know if I am wrong here.
I'm not quite sure either.
Thanks,
Jason
>
> Thanks,
> Fernando.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-25 12:11 ` Jason Xing
@ 2025-11-25 16:31 ` Maciej Fijalkowski
2025-11-26 1:14 ` Jason Xing
0 siblings, 1 reply; 12+ messages in thread
From: Maciej Fijalkowski @ 2025-11-25 16:31 UTC (permalink / raw)
To: Jason Xing
Cc: Fernando Fernandez Mancera, netdev, csmate, bpf, davem, edumazet,
kuba, pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Tue, Nov 25, 2025 at 08:11:37PM +0800, Jason Xing wrote:
> On Tue, Nov 25, 2025 at 7:40 PM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
> >
> > On 11/25/25 12:41 AM, Jason Xing wrote:
> > > On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
> > > <fmancera@suse.de> wrote:
> > >>
> > >> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> > >> production"), the descriptor number is stored in skb control block and
> > >> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> > >> pool's completion queue.
> > >>
> > >> skb control block shouldn't be used for this purpose as after transmit
> > >> xsk doesn't have control over it and other subsystems could use it. This
> > >> leads to the following kernel panic due to a NULL pointer dereference.
> > >>
> > >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> > >> #PF: supervisor read access in kernel mode
> > >> #PF: error_code(0x0000) - not-present page
> > >> PGD 0 P4D 0
> > >> Oops: Oops: 0000 [#1] SMP NOPTI
> > >> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> > >> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> > >> [...]
> > >> Call Trace:
> > >> <IRQ>
> > >> ? napi_complete_done+0x7a/0x1a0
> > >> ip_rcv_core+0x1bb/0x340
> > >> ip_rcv+0x30/0x1f0
> > >> __netif_receive_skb_one_core+0x85/0xa0
> > >> process_backlog+0x87/0x130
> > >> __napi_poll+0x28/0x180
> > >> net_rx_action+0x339/0x420
> > >> handle_softirqs+0xdc/0x320
> > >> ? handle_edge_irq+0x90/0x1e0
> > >> do_softirq.part.0+0x3b/0x60
> > >> </IRQ>
> > >> <TASK>
> > >> __local_bh_enable_ip+0x60/0x70
> > >> __dev_direct_xmit+0x14e/0x1f0
> > >> __xsk_generic_xmit+0x482/0xb70
> > >> ? __remove_hrtimer+0x41/0xa0
> > >> ? __xsk_generic_xmit+0x51/0xb70
> > >> ? _raw_spin_unlock_irqrestore+0xe/0x40
> > >> xsk_sendmsg+0xda/0x1c0
> > >> __sys_sendto+0x1ee/0x200
> > >> __x64_sys_sendto+0x24/0x30
> > >> do_syscall_64+0x84/0x2f0
> > >> ? __pfx_pollwake+0x10/0x10
> > >> ? __rseq_handle_notify_resume+0xad/0x4c0
> > >> ? restore_fpregs_from_fpstate+0x3c/0x90
> > >> ? switch_fpu_return+0x5b/0xe0
> > >> ? do_syscall_64+0x204/0x2f0
> > >> ? do_syscall_64+0x204/0x2f0
> > >> ? do_syscall_64+0x204/0x2f0
> > >> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >> </TASK>
> > >> [...]
> > >> Kernel panic - not syncing: Fatal exception in interrupt
> > >> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > >>
> > >> Instead use the skb destructor_arg pointer along with pointer tagging.
> > >> As pointers are always aligned to 8B, use the bottom bit to indicate
> > >> whether this a single address or an allocated struct containing several
> > >> addresses.
> > >>
> > >> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> > >> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> > >> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> > >> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> > >
> > > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> > >
> > > Could you also post a patch on top of net-next as it has diverged from
> > > the net tree?
> > >
> >
> > I think that is handled by maintainers when merging the branches. A
> > repost would be wrong because linux-next.git and linux.git will have a
> > different variant of the same commit..
>
> But this patch cannot be applied cleanly in the net-next tree...
What we care here is that it applies to net as that's a tree that this
patch has been posted to.
>
> >
> > Please, let me know if I am wrong here.
>
> I'm not quite sure either.
>
> Thanks,
> Jason
>
> >
> > Thanks,
> > Fernando.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-25 16:31 ` Maciej Fijalkowski
@ 2025-11-26 1:14 ` Jason Xing
2025-11-26 9:15 ` Fernando Fernandez Mancera
0 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2025-11-26 1:14 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Fernando Fernandez Mancera, netdev, csmate, bpf, davem, edumazet,
kuba, pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Wed, Nov 26, 2025 at 12:31 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Nov 25, 2025 at 08:11:37PM +0800, Jason Xing wrote:
> > On Tue, Nov 25, 2025 at 7:40 PM Fernando Fernandez Mancera
> > <fmancera@suse.de> wrote:
> > >
> > > On 11/25/25 12:41 AM, Jason Xing wrote:
> > > > On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
> > > > <fmancera@suse.de> wrote:
> > > >>
> > > >> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> > > >> production"), the descriptor number is stored in skb control block and
> > > >> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> > > >> pool's completion queue.
> > > >>
> > > >> skb control block shouldn't be used for this purpose as after transmit
> > > >> xsk doesn't have control over it and other subsystems could use it. This
> > > >> leads to the following kernel panic due to a NULL pointer dereference.
> > > >>
> > > >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > >> #PF: supervisor read access in kernel mode
> > > >> #PF: error_code(0x0000) - not-present page
> > > >> PGD 0 P4D 0
> > > >> Oops: Oops: 0000 [#1] SMP NOPTI
> > > >> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> > > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> > > >> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> > > >> [...]
> > > >> Call Trace:
> > > >> <IRQ>
> > > >> ? napi_complete_done+0x7a/0x1a0
> > > >> ip_rcv_core+0x1bb/0x340
> > > >> ip_rcv+0x30/0x1f0
> > > >> __netif_receive_skb_one_core+0x85/0xa0
> > > >> process_backlog+0x87/0x130
> > > >> __napi_poll+0x28/0x180
> > > >> net_rx_action+0x339/0x420
> > > >> handle_softirqs+0xdc/0x320
> > > >> ? handle_edge_irq+0x90/0x1e0
> > > >> do_softirq.part.0+0x3b/0x60
> > > >> </IRQ>
> > > >> <TASK>
> > > >> __local_bh_enable_ip+0x60/0x70
> > > >> __dev_direct_xmit+0x14e/0x1f0
> > > >> __xsk_generic_xmit+0x482/0xb70
> > > >> ? __remove_hrtimer+0x41/0xa0
> > > >> ? __xsk_generic_xmit+0x51/0xb70
> > > >> ? _raw_spin_unlock_irqrestore+0xe/0x40
> > > >> xsk_sendmsg+0xda/0x1c0
> > > >> __sys_sendto+0x1ee/0x200
> > > >> __x64_sys_sendto+0x24/0x30
> > > >> do_syscall_64+0x84/0x2f0
> > > >> ? __pfx_pollwake+0x10/0x10
> > > >> ? __rseq_handle_notify_resume+0xad/0x4c0
> > > >> ? restore_fpregs_from_fpstate+0x3c/0x90
> > > >> ? switch_fpu_return+0x5b/0xe0
> > > >> ? do_syscall_64+0x204/0x2f0
> > > >> ? do_syscall_64+0x204/0x2f0
> > > >> ? do_syscall_64+0x204/0x2f0
> > > >> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >> </TASK>
> > > >> [...]
> > > >> Kernel panic - not syncing: Fatal exception in interrupt
> > > >> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > >>
> > > >> Instead use the skb destructor_arg pointer along with pointer tagging.
> > > >> As pointers are always aligned to 8B, use the bottom bit to indicate
> > > >> whether this a single address or an allocated struct containing several
> > > >> addresses.
> > > >>
> > > >> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> > > >> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> > > >> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> > > >> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> > > >
> > > > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> > > >
> > > > Could you also post a patch on top of net-next as it has diverged from
> > > > the net tree?
> > > >
> > >
> > > I think that is handled by maintainers when merging the branches. A
> > > repost would be wrong because linux-next.git and linux.git will have a
> > > different variant of the same commit..
> >
> > But this patch cannot be applied cleanly in the net-next tree...
>
> What we care here is that it applies to net as that's a tree that this
> patch has been posted to.
It sounds like I can post my approach without this patch on net-next,
right? I have no idea how long I should keep waiting :S
To be clear, what I meant was to ask Fernando to post a new rebased
patch targetting net-next. If the patch doesn't need to land on
net-next, I will post it as soon as possible.
Thanks,
Jason
> >
> > >
> > > Please, let me know if I am wrong here.
> >
> > I'm not quite sure either.
> >
> > Thanks,
> > Jason
> >
> > >
> > > Thanks,
> > > Fernando.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-24 17:14 [PATCH net v6] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-11-24 23:03 ` Maciej Fijalkowski
2025-11-24 23:41 ` Jason Xing
@ 2025-11-26 4:00 ` patchwork-bot+netdevbpf
2025-11-26 11:42 ` [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge Matthieu Baerts
3 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-26 4:00 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, csmate, kerneljasonxing, maciej.fijalkowski, bpf, davem,
edumazet, kuba, pabeni, horms, sdf, hawk, daniel, ast,
john.fastabend, magnus.karlsson
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 24 Nov 2025 18:14:09 +0100 you wrote:
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
>
> skb control block shouldn't be used for this purpose as after transmit
> xsk doesn't have control over it and other subsystems could use it. This
> leads to the following kernel panic due to a NULL pointer dereference.
>
> [...]
Here is the summary with links:
- [net,v6] xsk: avoid data corruption on cq descriptor number
https://git.kernel.org/netdev/net/c/0ebc27a4c67d
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-26 1:14 ` Jason Xing
@ 2025-11-26 9:15 ` Fernando Fernandez Mancera
2025-11-26 11:00 ` Jason Xing
0 siblings, 1 reply; 12+ messages in thread
From: Fernando Fernandez Mancera @ 2025-11-26 9:15 UTC (permalink / raw)
To: Jason Xing, Maciej Fijalkowski
Cc: netdev, csmate, bpf, davem, edumazet, kuba, pabeni, horms, sdf,
hawk, daniel, ast, john.fastabend, magnus.karlsson
On 11/26/25 2:14 AM, Jason Xing wrote:
> On Wed, Nov 26, 2025 at 12:31 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
>>
>> On Tue, Nov 25, 2025 at 08:11:37PM +0800, Jason Xing wrote:
>>> On Tue, Nov 25, 2025 at 7:40 PM Fernando Fernandez Mancera
>>> <fmancera@suse.de> wrote:
>>>>
>>>> On 11/25/25 12:41 AM, Jason Xing wrote:
>>>>> On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
>>>>> <fmancera@suse.de> wrote:
>>>>>>
>>>>>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
>>>>>> production"), the descriptor number is stored in skb control block and
>>>>>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
>>>>>> pool's completion queue.
>>>>>>
>>>>>> skb control block shouldn't be used for this purpose as after transmit
>>>>>> xsk doesn't have control over it and other subsystems could use it. This
>>>>>> leads to the following kernel panic due to a NULL pointer dereference.
>>>>>>
>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>>> #PF: supervisor read access in kernel mode
>>>>>> #PF: error_code(0x0000) - not-present page
>>>>>> PGD 0 P4D 0
>>>>>> Oops: Oops: 0000 [#1] SMP NOPTI
>>>>>> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
>>>>>> RIP: 0010:xsk_destruct_skb+0xd0/0x180
>>>>>> [...]
>>>>>> Call Trace:
>>>>>> <IRQ>
>>>>>> ? napi_complete_done+0x7a/0x1a0
>>>>>> ip_rcv_core+0x1bb/0x340
>>>>>> ip_rcv+0x30/0x1f0
>>>>>> __netif_receive_skb_one_core+0x85/0xa0
>>>>>> process_backlog+0x87/0x130
>>>>>> __napi_poll+0x28/0x180
>>>>>> net_rx_action+0x339/0x420
>>>>>> handle_softirqs+0xdc/0x320
>>>>>> ? handle_edge_irq+0x90/0x1e0
>>>>>> do_softirq.part.0+0x3b/0x60
>>>>>> </IRQ>
>>>>>> <TASK>
>>>>>> __local_bh_enable_ip+0x60/0x70
>>>>>> __dev_direct_xmit+0x14e/0x1f0
>>>>>> __xsk_generic_xmit+0x482/0xb70
>>>>>> ? __remove_hrtimer+0x41/0xa0
>>>>>> ? __xsk_generic_xmit+0x51/0xb70
>>>>>> ? _raw_spin_unlock_irqrestore+0xe/0x40
>>>>>> xsk_sendmsg+0xda/0x1c0
>>>>>> __sys_sendto+0x1ee/0x200
>>>>>> __x64_sys_sendto+0x24/0x30
>>>>>> do_syscall_64+0x84/0x2f0
>>>>>> ? __pfx_pollwake+0x10/0x10
>>>>>> ? __rseq_handle_notify_resume+0xad/0x4c0
>>>>>> ? restore_fpregs_from_fpstate+0x3c/0x90
>>>>>> ? switch_fpu_return+0x5b/0xe0
>>>>>> ? do_syscall_64+0x204/0x2f0
>>>>>> ? do_syscall_64+0x204/0x2f0
>>>>>> ? do_syscall_64+0x204/0x2f0
>>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>> </TASK>
>>>>>> [...]
>>>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>>>> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>>>>>
>>>>>> Instead use the skb destructor_arg pointer along with pointer tagging.
>>>>>> As pointers are always aligned to 8B, use the bottom bit to indicate
>>>>>> whether this a single address or an allocated struct containing several
>>>>>> addresses.
>>>>>>
>>>>>> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
>>>>>> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
>>>>>> Suggested-by: Jakub Kicinski <kuba@kernel.org>
>>>>>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>>>>>
>>>>> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>>>>>
>>>>> Could you also post a patch on top of net-next as it has diverged from
>>>>> the net tree?
>>>>>
>>>>
>>>> I think that is handled by maintainers when merging the branches. A
>>>> repost would be wrong because linux-next.git and linux.git will have a
>>>> different variant of the same commit..
>>>
>>> But this patch cannot be applied cleanly in the net-next tree...
>>
>> What we care here is that it applies to net as that's a tree that this
>> patch has been posted to.
>
> It sounds like I can post my approach without this patch on net-next,
> right? I have no idea how long I should keep waiting :S
>
> To be clear, what I meant was to ask Fernando to post a new rebased
> patch targetting net-next. If the patch doesn't need to land on
> net-next, I will post it as soon as possible.
>
My patch landed on net tree and probably soon, net tree changes are
going to be merged on net-next tree. If there are conflicts when merging
the patch the maintainer will ask us or they will solve them.
That was my understanding of how the workflow is.
Thanks,
Fernando.
> Thanks,
> Jason
>
>>>
>>>>
>>>> Please, let me know if I am wrong here.
>>>
>>> I'm not quite sure either.
>>>
>>> Thanks,
>>> Jason
>>>
>>>>
>>>> Thanks,
>>>> Fernando.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number
2025-11-26 9:15 ` Fernando Fernandez Mancera
@ 2025-11-26 11:00 ` Jason Xing
0 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-11-26 11:00 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: Maciej Fijalkowski, netdev, csmate, bpf, davem, edumazet, kuba,
pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson
On Wed, Nov 26, 2025 at 5:15 PM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> On 11/26/25 2:14 AM, Jason Xing wrote:
> > On Wed, Nov 26, 2025 at 12:31 AM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> >>
> >> On Tue, Nov 25, 2025 at 08:11:37PM +0800, Jason Xing wrote:
> >>> On Tue, Nov 25, 2025 at 7:40 PM Fernando Fernandez Mancera
> >>> <fmancera@suse.de> wrote:
> >>>>
> >>>> On 11/25/25 12:41 AM, Jason Xing wrote:
> >>>>> On Tue, Nov 25, 2025 at 1:14 AM Fernando Fernandez Mancera
> >>>>> <fmancera@suse.de> wrote:
> >>>>>>
> >>>>>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> >>>>>> production"), the descriptor number is stored in skb control block and
> >>>>>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> >>>>>> pool's completion queue.
> >>>>>>
> >>>>>> skb control block shouldn't be used for this purpose as after transmit
> >>>>>> xsk doesn't have control over it and other subsystems could use it. This
> >>>>>> leads to the following kernel panic due to a NULL pointer dereference.
> >>>>>>
> >>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >>>>>> #PF: supervisor read access in kernel mode
> >>>>>> #PF: error_code(0x0000) - not-present page
> >>>>>> PGD 0 P4D 0
> >>>>>> Oops: Oops: 0000 [#1] SMP NOPTI
> >>>>>> CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
> >>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> >>>>>> RIP: 0010:xsk_destruct_skb+0xd0/0x180
> >>>>>> [...]
> >>>>>> Call Trace:
> >>>>>> <IRQ>
> >>>>>> ? napi_complete_done+0x7a/0x1a0
> >>>>>> ip_rcv_core+0x1bb/0x340
> >>>>>> ip_rcv+0x30/0x1f0
> >>>>>> __netif_receive_skb_one_core+0x85/0xa0
> >>>>>> process_backlog+0x87/0x130
> >>>>>> __napi_poll+0x28/0x180
> >>>>>> net_rx_action+0x339/0x420
> >>>>>> handle_softirqs+0xdc/0x320
> >>>>>> ? handle_edge_irq+0x90/0x1e0
> >>>>>> do_softirq.part.0+0x3b/0x60
> >>>>>> </IRQ>
> >>>>>> <TASK>
> >>>>>> __local_bh_enable_ip+0x60/0x70
> >>>>>> __dev_direct_xmit+0x14e/0x1f0
> >>>>>> __xsk_generic_xmit+0x482/0xb70
> >>>>>> ? __remove_hrtimer+0x41/0xa0
> >>>>>> ? __xsk_generic_xmit+0x51/0xb70
> >>>>>> ? _raw_spin_unlock_irqrestore+0xe/0x40
> >>>>>> xsk_sendmsg+0xda/0x1c0
> >>>>>> __sys_sendto+0x1ee/0x200
> >>>>>> __x64_sys_sendto+0x24/0x30
> >>>>>> do_syscall_64+0x84/0x2f0
> >>>>>> ? __pfx_pollwake+0x10/0x10
> >>>>>> ? __rseq_handle_notify_resume+0xad/0x4c0
> >>>>>> ? restore_fpregs_from_fpstate+0x3c/0x90
> >>>>>> ? switch_fpu_return+0x5b/0xe0
> >>>>>> ? do_syscall_64+0x204/0x2f0
> >>>>>> ? do_syscall_64+0x204/0x2f0
> >>>>>> ? do_syscall_64+0x204/0x2f0
> >>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>>>>> </TASK>
> >>>>>> [...]
> >>>>>> Kernel panic - not syncing: Fatal exception in interrupt
> >>>>>> Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>>>>>
> >>>>>> Instead use the skb destructor_arg pointer along with pointer tagging.
> >>>>>> As pointers are always aligned to 8B, use the bottom bit to indicate
> >>>>>> whether this a single address or an allocated struct containing several
> >>>>>> addresses.
> >>>>>>
> >>>>>> Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
> >>>>>> Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
> >>>>>> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> >>>>>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> >>>>>
> >>>>> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >>>>>
> >>>>> Could you also post a patch on top of net-next as it has diverged from
> >>>>> the net tree?
> >>>>>
> >>>>
> >>>> I think that is handled by maintainers when merging the branches. A
> >>>> repost would be wrong because linux-next.git and linux.git will have a
> >>>> different variant of the same commit..
> >>>
> >>> But this patch cannot be applied cleanly in the net-next tree...
> >>
> >> What we care here is that it applies to net as that's a tree that this
> >> patch has been posted to.
> >
> > It sounds like I can post my approach without this patch on net-next,
> > right? I have no idea how long I should keep waiting :S
> >
> > To be clear, what I meant was to ask Fernando to post a new rebased
> > patch targetting net-next. If the patch doesn't need to land on
> > net-next, I will post it as soon as possible.
> >
>
> My patch landed on net tree and probably soon, net tree changes are
> going to be merged on net-next tree. If there are conflicts when merging
> the patch the maintainer will ask us or they will solve them.
Right, it's a normal routine. I will wait then.
Thanks,
Jason
>
> That was my understanding of how the workflow is.
>
> Thanks,
> Fernando.
>
> > Thanks,
> > Jason
> >
> >>>
> >>>>
> >>>> Please, let me know if I am wrong here.
> >>>
> >>> I'm not quite sure either.
> >>>
> >>> Thanks,
> >>> Jason
> >>>
> >>>>
> >>>> Thanks,
> >>>> Fernando.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge
2025-11-24 17:14 [PATCH net v6] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
` (2 preceding siblings ...)
2025-11-26 4:00 ` patchwork-bot+netdevbpf
@ 2025-11-26 11:42 ` Matthieu Baerts
2025-11-26 12:05 ` Fernando Fernandez Mancera
3 siblings, 1 reply; 12+ messages in thread
From: Matthieu Baerts @ 2025-11-26 11:42 UTC (permalink / raw)
To: Fernando Fernandez Mancera, netdev
Cc: csmate, kerneljasonxing, maciej.fijalkowski, bpf, davem, edumazet,
kuba, pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson, Stephen Rothwell, Mark Brown
[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]
Hello,
On 24/11/2025 18:14, Fernando Fernandez Mancera wrote:
> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> production"), the descriptor number is stored in skb control block and
> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
> pool's completion queue.
>
> skb control block shouldn't be used for this purpose as after transmit
> xsk doesn't have control over it and other subsystems could use it. This
> leads to the following kernel panic due to a NULL pointer dereference.
FYI, and as predicted by Jason, we got a small conflict when merging
'net' in 'net-next' in the MPTCP tree due to this patch applied in 'net':
0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
and this one from 'net-next':
8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
----- Generic Message -----
The best is to avoid conflicts between 'net' and 'net-next' trees but if
they cannot be avoided when preparing patches, a note about how to fix
them is much appreciated.
The conflict has been resolved on our side [1] and the resolution we
suggest is attached to this email. Please report any issues linked to
this conflict resolution as it might be used by others. If you worked on
the mentioned patches, don't hesitate to ACK this conflict resolution.
---------------------------
Regarding this conflict, the patch from 'net' removed two functions
above one that has been applied in 'net-next'. I then combined the two
modifications by removing the two functions and keeping the new
attributes set to xsk_destruct_skb().
Rerere cache is available in [2].
Cheers,
Matt
1: https://github.com/multipath-tcp/mptcp_net-next/commit/1caa6b15d784
2: https://github.com/multipath-tcp/mptcp-upstream-rr-cache/commit/265e1
--
Sponsored by the NGI0 Core fund.
[-- Attachment #2: 1caa6b15d784769d23637d9c5dae18ddc4bd8b39.patch --]
[-- Type: text/x-patch, Size: 2275 bytes --]
diff --cc net/xdp/xsk.c
index bcfd400e9cf8,69bbcca8ac75..f093c3453f64
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@@ -560,50 -591,43 +590,42 @@@ static u32 xsk_get_num_desc(struct sk_b
static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
struct sk_buff *skb)
{
- struct xsk_addr_node *pos, *tmp;
+ u32 num_descs = xsk_get_num_desc(skb);
+ struct xsk_addrs *xsk_addr;
u32 descs_processed = 0;
unsigned long flags;
- u32 idx;
+ u32 idx, i;
- spin_lock_irqsave(&pool->cq_lock, flags);
+ spin_lock_irqsave(&pool->cq_prod_lock, flags);
idx = xskq_get_prod(pool->cq);
- xskq_prod_write_addr(pool->cq, idx,
- (u64)(uintptr_t)skb_shinfo(skb)->destructor_arg);
- descs_processed++;
+ if (unlikely(num_descs > 1)) {
+ xsk_addr = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
- if (unlikely(XSKCB(skb)->num_descs > 1)) {
- list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
+ for (i = 0; i < num_descs; i++) {
xskq_prod_write_addr(pool->cq, idx + descs_processed,
- pos->addr);
+ xsk_addr->addrs[i]);
descs_processed++;
- list_del(&pos->addr_node);
- kmem_cache_free(xsk_tx_generic_cache, pos);
}
+ kmem_cache_free(xsk_tx_generic_cache, xsk_addr);
+ } else {
+ xskq_prod_write_addr(pool->cq, idx,
+ xsk_skb_destructor_get_addr(skb));
+ descs_processed++;
}
xskq_prod_submit_n(pool->cq, descs_processed);
- spin_unlock_irqrestore(&pool->cq_lock, flags);
+ spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
}
static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
{
- unsigned long flags;
-
- spin_lock_irqsave(&pool->cq_lock, flags);
+ spin_lock(&pool->cq_cached_prod_lock);
xskq_prod_cancel_n(pool->cq, n);
- spin_unlock_irqrestore(&pool->cq_lock, flags);
+ spin_unlock(&pool->cq_cached_prod_lock);
}
- static void xsk_inc_num_desc(struct sk_buff *skb)
- {
- XSKCB(skb)->num_descs++;
- }
-
- static u32 xsk_get_num_desc(struct sk_buff *skb)
- {
- return XSKCB(skb)->num_descs;
- }
-
-static void xsk_destruct_skb(struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE
+void xsk_destruct_skb(struct sk_buff *skb)
{
struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge
2025-11-26 11:42 ` [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge Matthieu Baerts
@ 2025-11-26 12:05 ` Fernando Fernandez Mancera
0 siblings, 0 replies; 12+ messages in thread
From: Fernando Fernandez Mancera @ 2025-11-26 12:05 UTC (permalink / raw)
To: Matthieu Baerts, netdev
Cc: csmate, kerneljasonxing, maciej.fijalkowski, bpf, davem, edumazet,
kuba, pabeni, horms, sdf, hawk, daniel, ast, john.fastabend,
magnus.karlsson, Stephen Rothwell, Mark Brown
On 11/26/25 12:42 PM, Matthieu Baerts wrote:
> Hello,
>
> On 24/11/2025 18:14, Fernando Fernandez Mancera wrote:
>> Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
>> production"), the descriptor number is stored in skb control block and
>> xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
>> pool's completion queue.
>>
>> skb control block shouldn't be used for this purpose as after transmit
>> xsk doesn't have control over it and other subsystems could use it. This
>> leads to the following kernel panic due to a NULL pointer dereference.
>
> FYI, and as predicted by Jason, we got a small conflict when merging
> 'net' in 'net-next' in the MPTCP tree due to this patch applied in 'net':
>
> 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
>
> and this one from 'net-next':
>
> 8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
>
> ----- Generic Message -----
> The best is to avoid conflicts between 'net' and 'net-next' trees but if
> they cannot be avoided when preparing patches, a note about how to fix
> them is much appreciated.
>
> The conflict has been resolved on our side [1] and the resolution we
> suggest is attached to this email. Please report any issues linked to
> this conflict resolution as it might be used by others. If you worked on
> the mentioned patches, don't hesitate to ACK this conflict resolution.
> ---------------------------
>
Noted, thank you!
> Regarding this conflict, the patch from 'net' removed two functions
> above one that has been applied in 'net-next'. I then combined the two
> modifications by removing the two functions and keeping the new
> attributes set to xsk_destruct_skb().
>
> Rerere cache is available in [2].
>
Acked-by: Fernando Fernandez Mancera <fmancera@suse.de>
> Cheers,
> Matt
>
> 1: https://github.com/multipath-tcp/mptcp_net-next/commit/1caa6b15d784
> 2: https://github.com/multipath-tcp/mptcp-upstream-rr-cache/commit/265e1
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-11-26 12:05 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 17:14 [PATCH net v6] xsk: avoid data corruption on cq descriptor number Fernando Fernandez Mancera
2025-11-24 23:03 ` Maciej Fijalkowski
2025-11-24 23:41 ` Jason Xing
2025-11-25 11:40 ` Fernando Fernandez Mancera
2025-11-25 12:11 ` Jason Xing
2025-11-25 16:31 ` Maciej Fijalkowski
2025-11-26 1:14 ` Jason Xing
2025-11-26 9:15 ` Fernando Fernandez Mancera
2025-11-26 11:00 ` Jason Xing
2025-11-26 4:00 ` patchwork-bot+netdevbpf
2025-11-26 11:42 ` [PATCH net v6] xsk: avoid data corruption on cq descriptor number: manual merge Matthieu Baerts
2025-11-26 12:05 ` Fernando Fernandez Mancera
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).