bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] mlx5_core memory management issue
@ 2025-07-03 15:49 Chris Arges
  2025-07-04 12:37 ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Arges @ 2025-07-03 15:49 UTC (permalink / raw)
  To: netdev, bpf
  Cc: kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

When running iperf through a set of XDP programs we were able to crash
machines with NICs using the mlx5_core driver. We were able to confirm
that other NICs/drivers did not exhibit the same problem, and suspect
this could be a memory management issue in the driver code.
Specifically we found a WARNING at include/net/page_pool/helpers.h:277
mlx5e_page_release_fragmented.isra. We are able to demonstrate this
issue in production using hardware, but cannot easily bisect because
we don’t have a simple reproducer. I wanted to share stack traces in
order to help us further debug and understand if anyone else has run
into this issue. We are currently working on getting more crashdumps
and doing further analysis.


The test setup looks like the following:
  ┌─────┐
  │mlx5 │
  │NIC  │
  └──┬──┘
     │xdp ebpf program (does encap and XDP_TX)
     │
     ▼
  ┌──────────────────────┐
  │xdp.frags             │
  │                      │
  └──┬───────────────────┘
     │tailcall
     │BPF_REDIRECT_MAP (using CPUMAP bpf type)
     ▼
  ┌──────────────────────┐
  │xdp.frags/cpumap      │
  │                      │
  └──┬───────────────────┘
     │BPF_REDIRECT to veth (*potential trigger for issue)
     │
     ▼
  ┌──────┐
  │veth  │
  │      │
  └──┬───┘
     │
     │
     ▼

Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
choose a random valid CPU to reproduce the issue. Once that packet
reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
to a veth device which has an XDP program which redirects to an
XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
veth device that we noticed this issue.

When running with 6.12.30 to 6.12.32 kernels we are able to see the
following KASAN use-after-free WARNINGs followed by a page fault which
crashes the machine. We have not been able to test earlier or later
kernels. I’ve tried to map symbols to lines of code for clarity.

------------[ cut here ]------------
WARNING: CPU: 157 PID: 0 at include/net/page_pool/helpers.h:277
mlx5e_page_release_fragmented.isra.0+0xf7/0x150 [mlx5_core]

mlx5e_page_release_fragmented.isra.0
(include/net/page_pool/helpers.h:277 (discriminator 1)
include/net/page_pool/helpers.h:292 (discriminator 1)
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:301 (discriminator 1))
mlx5_core

 ==================================================================
Modules linked in:
 BUG: KASAN: use-after-free in veth_xdp_rcv.constprop.0+0x9a6/0xc40 [veth]
 mptcp_diag
 Read of size 2 at addr ffff88b8c9eee008 by task napi/iconduit-g/681556

 CPU: 34 UID: 0 PID: 681556 Comm: napi/iconduit-g Kdump: loaded
Tainted: G        W  O       6.12.30-cloudflare-kasan-2025.5.26 #1
 Tainted: [W]=WARN, [O]=OOT_MODULE
 Hardware name: Lenovo HR355M-V3-G12/HR355M_V3_HPM, BIOS
HR355M_V3.G.031 02/17/2025
 Call Trace:
  <TASK>
 dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:323 drivers/net/veth.c:924) veth
kasan_report (mm/kasan/report.c:220 mm/kasan/report.c:603)
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:323 drivers/net/veth.c:924) veth
veth_xdp_rcv.constprop.0 (include/net/xdp.h:323 drivers/net/veth.c:924) veth
? napi_threaded_poll_loop (net/core/dev.c:6377 net/core/dev.c:6363
net/core/dev.c:6967)
? __pfx_veth_xdp_rcv.constprop.0 (drivers/net/veth.c:899) veth
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
veth_poll (drivers/net/veth.c:981) veth
? update_load_avg (kernel/sched/fair.c:4531 kernel/sched/fair.c:4868)
? __pfx_veth_poll (drivers/net/veth.c:969) veth
? __pfx___perf_event_task_sched_out (kernel/events/core.c:3765)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? finish_task_switch.isra.0 (arch/x86/include/asm/irqflags.h:42
arch/x86/include/asm/irqflags.h:119 kernel/sched/sched.h:1527
kernel/sched/core.c:5086 kernel/sched/core.c:5204)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __switch_to (arch/x86/include/asm/bitops.h:55
include/asm-generic/bitops/instrumented-atomic.h:29
include/linux/thread_info.h:89 include/linux/sched.h:1978
arch/x86/include/asm/fpu/sched.h:68 arch/x86/kernel/process_64.c:674)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __schedule (kernel/sched/core.c:6592)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __pfx_migrate_enable (kernel/sched/core.c:2338)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? napi_pp_put_page (arch/x86/include/asm/atomic64_64.h:79
(discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2913
(discriminator 5) include/linux/atomic/atomic-long.h:331
(discriminator 5) include/linux/atomic/atomic-instrumented.h:3446
(discriminator 5) include/net/page_pool/helpers.h:276 (discriminator
5) include/net/page_pool/helpers.h:308 (discriminator 5)
include/net/page_pool/helpers.h:320 (discriminator 5)
include/net/page_pool/helpers.h:353 (discriminator 5)
net/core/skbuff.c:1040 (discriminator 5))
__napi_poll (net/core/dev.c:6837)
bpf_trampoline_6442548359+0x79/0x123
? __cond_resched (arch/x86/include/asm/preempt.h:84 (discriminator 13)
kernel/sched/core.c:6891 (discriminator 13) kernel/sched/core.c:7234
(discriminator 13))
__napi_poll (net/core/dev.c:6824)
napi_threaded_poll_loop (include/linux/netpoll.h:90 net/core/dev.c:6958)
? __pfx_napi_threaded_poll_loop (net/core/dev.c:6941)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? sysvec_call_function_single (arch/x86/include/asm/hardirq.h:78
(discriminator 2) arch/x86/kernel/smp.c:266 (discriminator 2))
? napi_threaded_poll (arch/x86/include/asm/bitops.h:206
arch/x86/include/asm/bitops.h:238
include/asm-generic/bitops/instrumented-non-atomic.h:142
net/core/dev.c:6926 net/core/dev.c:6983)
napi_threaded_poll (net/core/dev.c:6984)
? __pfx_napi_threaded_poll (net/core/dev.c:6980)
kthread (kernel/kthread.c:389)
? recalc_sigpending (arch/x86/include/asm/bitops.h:75
include/asm-generic/bitops/instrumented-atomic.h:42
include/linux/thread_info.h:94 kernel/signal.c:178)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork (arch/x86/kernel/process.c:152)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

 xsk_diag
 The buggy address belongs to the physical page:
 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x38c9eee
 flags: 0x1effff800000000(node=7|zone=2|lastcpupid=0x1ffff)
 raw_diag
 raw: 01effff800000000 ffffea00e3075c48 ffffea00e3211648 0000000000000000
 raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 unix_diag
 Memory state around the buggy address:
  ffff88b8c9eedf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88b8c9eedf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 >ffff88b8c9eee000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                       ^
 af_packet_diag
  ffff88b8c9eee080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88b8c9eee100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ==================================================================
 netlink_diag
 Disabling lock debugging due to kernel taint
 nfnetlink_queue xt_TPROXY
 ==================================================================
 BUG: KASAN: use-after-free in veth_xdp_rcv.constprop.0
(include/net/xdp.h:182 include/net/xdp.h:325 drivers/net/veth.c:924)
veth

 nf_tproxy_ipv6
 Read of size 4 at addr ffff88b8c9eee024 by task napi/iconduit-g/681556

 CPU: 34 UID: 0 PID: 681556 Comm: napi/iconduit-g Kdump: loaded
Tainted: G    B   W  O       6.12.30-cloudflare-kasan-2025.5.26 #1
 Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
 Hardware name: Lenovo HR355M-V3-G12/HR355M_V3_HPM, BIOS
HR355M_V3.G.031 02/17/2025
Call Trace:
 <TASK>
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
? add_taint (include/linux/debug_locks.h:16 (discriminator 4)
kernel/panic.c:602 (discriminator 4))
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:182
include/net/xdp.h:325 drivers/net/veth.c:924) veth
kasan_report (mm/kasan/report.c:220 mm/kasan/report.c:603)
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:182
include/net/xdp.h:325 drivers/net/veth.c:924) veth
veth_xdp_rcv.constprop.0 (include/net/xdp.h:182 include/net/xdp.h:325
drivers/net/veth.c:924) veth
? napi_threaded_poll_loop (net/core/dev.c:6377 net/core/dev.c:6363
net/core/dev.c:6967)
? __pfx_veth_xdp_rcv.constprop.0 (drivers/net/veth.c:899) veth
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
veth_poll (drivers/net/veth.c:981) veth
? update_load_avg (kernel/sched/fair.c:4531 kernel/sched/fair.c:4868)
? __pfx_veth_poll (drivers/net/veth.c:969) veth
? __pfx___perf_event_task_sched_out (kernel/events/core.c:3765)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? finish_task_switch.isra.0 (arch/x86/include/asm/irqflags.h:42
arch/x86/include/asm/irqflags.h:119 kernel/sched/sched.h:1527
kernel/sched/core.c:5086 kernel/sched/core.c:5204)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __switch_to (arch/x86/include/asm/bitops.h:55
include/asm-generic/bitops/instrumented-atomic.h:29
include/linux/thread_info.h:89 include/linux/sched.h:1978
arch/x86/include/asm/fpu/sched.h:68 arch/x86/kernel/process_64.c:674)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __schedule (kernel/sched/core.c:6592)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __pfx_migrate_enable (kernel/sched/core.c:2338)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? napi_pp_put_page (arch/x86/include/asm/atomic64_64.h:79
(discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2913
(discriminator 5) include/linux/atomic/atomic-long.h:331
(discriminator 5) include/linux/atomic/atomic-instrumented.h:3446
(discriminator 5) include/net/page_pool/helpers.h:276 (discriminator
5) include/net/page_pool/helpers.h:308 (discriminator 5)
include/net/page_pool/helpers.h:320 (discriminator 5)
include/net/page_pool/helpers.h:353 (discriminator 5)
net/core/skbuff.c:1040 (discriminator 5))
__napi_poll (net/core/dev.c:6837)
bpf_trampoline_6442548359+0x79/0x123
? __cond_resched (arch/x86/include/asm/preempt.h:84 (discriminator 13)
kernel/sched/core.c:6891 (discriminator 13) kernel/sched/core.c:7234
(discriminator 13))
__napi_poll (net/core/dev.c:6824)
napi_threaded_poll_loop (include/linux/netpoll.h:90 net/core/dev.c:6958)
? __pfx_napi_threaded_poll_loop (net/core/dev.c:6941)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? sysvec_call_function_single (arch/x86/include/asm/hardirq.h:78
(discriminator 2) arch/x86/kernel/smp.c:266 (discriminator 2))
? napi_threaded_poll (arch/x86/include/asm/bitops.h:206
arch/x86/include/asm/bitops.h:238
include/asm-generic/bitops/instrumented-non-atomic.h:142
net/core/dev.c:6926 net/core/dev.c:6983)
napi_threaded_poll (net/core/dev.c:6984)
? __pfx_napi_threaded_poll (net/core/dev.c:6980)
kthread (kernel/kthread.c:389)
? recalc_sigpending (arch/x86/include/asm/bitops.h:75
include/asm-generic/bitops/instrumented-atomic.h:42
include/linux/thread_info.h:94 kernel/signal.c:178)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork (arch/x86/kernel/process.c:152)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>

nf_tproxy_ipv4
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x38c9eee
flags: 0x1effff800000000(node=7|zone=2|lastcpupid=0x1ffff)
xt_socket
raw: 01effff800000000 ffffea00e3075c48 ffffea00e3211648 0000000000000000
raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

nf_socket_ipv4
Memory state around the buggy address:
 ffff88b8c9eedf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
nf_socket_ipv6
 ffff88b8c9eedf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88b8c9eee000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                               ^
xt_NFQUEUE
 ffff88b8c9eee080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88b8c9eee100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
==================================================================
overlay
BUG: KASAN: use-after-free in veth_xdp_rcv_one+0xb0c/0xce0 [veth]
Read of size 8 at addr ffff88b8c9eee000 by task napi/iconduit-g/681556
esp4

CPU: 34 UID: 0 PID: 681556 Comm: napi/iconduit-g Kdump: loaded
Tainted: G    B   W  O       6.12.30-cloudflare-kasan-2025.5.26 #1
Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
Hardware name: Lenovo HR355M-V3-G12/HR355M_V3_HPM, BIOS
HR355M_V3.G.031 02/17/2025
Call Trace:
 <TASK>
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153)
? veth_xdp_rcv_one (include/net/xdp.h:254 drivers/net/veth.c:650) veth
kasan_report (mm/kasan/report.c:220 mm/kasan/report.c:603)
? veth_xdp_rcv_one (include/net/xdp.h:254 drivers/net/veth.c:650) veth
veth_xdp_rcv_one (include/net/xdp.h:254 drivers/net/veth.c:650) veth
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:182
include/net/xdp.h:325 drivers/net/veth.c:924) veth
? __pfx_veth_xdp_rcv_one (drivers/net/veth.c:639) veth
? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:152
(discriminator 2) kernel/locking/spinlock.c:194 (discriminator 2))
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? add_taint (arch/x86/include/asm/bitops.h:60
include/asm-generic/bitops/instrumented-atomic.h:29
kernel/panic.c:605)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? end_report.part.0 (mm/kasan/report.c:242)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? veth_xdp_rcv.constprop.0 (include/net/xdp.h:182
include/net/xdp.h:325 drivers/net/veth.c:924) veth
veth_xdp_rcv.constprop.0 (drivers/net/veth.c:926) veth
? napi_threaded_poll_loop (net/core/dev.c:6377 net/core/dev.c:6363
net/core/dev.c:6967)
? __pfx_veth_xdp_rcv.constprop.0 (drivers/net/veth.c:899) veth
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
veth_poll (drivers/net/veth.c:981) veth
? update_load_avg (kernel/sched/fair.c:4531 kernel/sched/fair.c:4868)
? __pfx_veth_poll (drivers/net/veth.c:969) veth
? __pfx___perf_event_task_sched_out (kernel/events/core.c:3765)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? finish_task_switch.isra.0 (arch/x86/include/asm/irqflags.h:42
arch/x86/include/asm/irqflags.h:119 kernel/sched/sched.h:1527
kernel/sched/core.c:5086 kernel/sched/core.c:5204)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __switch_to (arch/x86/include/asm/bitops.h:55
include/asm-generic/bitops/instrumented-atomic.h:29
include/linux/thread_info.h:89 include/linux/sched.h:1978
arch/x86/include/asm/fpu/sched.h:68 arch/x86/kernel/process_64.c:674)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __schedule (kernel/sched/core.c:6592)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? __pfx_migrate_enable (kernel/sched/core.c:2338)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? napi_pp_put_page (arch/x86/include/asm/atomic64_64.h:79
(discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2913
(discriminator 5) include/linux/atomic/atomic-long.h:331
(discriminator 5) include/linux/atomic/atomic-instrumented.h:3446
(discriminator 5) include/net/page_pool/helpers.h:276 (discriminator
5) include/net/page_pool/helpers.h:308 (discriminator 5)
include/net/page_pool/helpers.h:320 (discriminator 5)
include/net/page_pool/helpers.h:353 (discriminator 5)
net/core/skbuff.c:1040 (discriminator 5))
__napi_poll (net/core/dev.c:6837)
bpf_trampoline_6442548359+0x79/0x123
? __cond_resched (arch/x86/include/asm/preempt.h:84 (discriminator 13)
kernel/sched/core.c:6891 (discriminator 13) kernel/sched/core.c:7234
(discriminator 13))
__napi_poll (net/core/dev.c:6824)
napi_threaded_poll_loop (include/linux/netpoll.h:90 net/core/dev.c:6958)
? __pfx_napi_threaded_poll_loop (net/core/dev.c:6941)
? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
? sysvec_call_function_single (arch/x86/include/asm/hardirq.h:78
(discriminator 2) arch/x86/kernel/smp.c:266 (discriminator 2))
? napi_threaded_poll (arch/x86/include/asm/bitops.h:206
arch/x86/include/asm/bitops.h:238
include/asm-generic/bitops/instrumented-non-atomic.h:142
net/core/dev.c:6926 net/core/dev.c:6983)
napi_threaded_poll (net/core/dev.c:6984)
? __pfx_napi_threaded_poll (net/core/dev.c:6980)
kthread (kernel/kthread.c:389)
? recalc_sigpending (arch/x86/include/asm/bitops.h:75
include/asm-generic/bitops/instrumented-atomic.h:42
include/linux/thread_info.h:94 kernel/signal.c:178)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork (arch/x86/kernel/process.c:152)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>

xt_hashlimit
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x38c9eee
flags: 0x1effff800000000(node=7|zone=2|lastcpupid=0x1ffff)
ip_set_hash_netport
raw: 01effff800000000 ffffea00e3075c48 ffffea00e3211648 0000000000000000
raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
xt_length

Memory state around the buggy address:
 ffff88b8c9eedf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88b8c9eedf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88b8c9eee000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
nft_compat
                   ^
 ffff88b8c9eee080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88b8c9eee100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
nf_conntrack_netlink
==================================================================
xfrm_interface
BUG: KASAN: use-after-free in veth_xdp_rcv_one+0x995/0xce0 [veth]
Read of size 2 at addr ffff88b8c9eee00a by task napi/iconduit-g/681556
xfrm6_tunnel

CPU: 34 UID: 0 PID: 681556 Comm: napi/iconduit-g Kdump: loaded
Tainted: G    B   W  O       6.12.30-cloudflare-kasan-2025.5.26 #1
Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
Hardware name: Lenovo HR355M-V3-G12/HR355M_V3_HPM, BIOS
HR355M_V3.G.031 02/17/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x4b/0x70
 print_report+0x14d/0x4cf
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? veth_xdp_rcv_one+0x995/0xce0 [veth]
 kasan_report+0xb6/0x140

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-03 15:49 [BUG] mlx5_core memory management issue Chris Arges
@ 2025-07-04 12:37 ` Dragos Tatulea
  2025-07-04 20:14   ` Dragos Tatulea
  2025-07-23 18:48   ` Chris Arges
  0 siblings, 2 replies; 22+ messages in thread
From: Dragos Tatulea @ 2025-07-04 12:37 UTC (permalink / raw)
  To: Chris Arges, netdev, bpf
  Cc: kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> When running iperf through a set of XDP programs we were able to crash
> machines with NICs using the mlx5_core driver. We were able to confirm
> that other NICs/drivers did not exhibit the same problem, and suspect
> this could be a memory management issue in the driver code.
> Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> issue in production using hardware, but cannot easily bisect because
> we don’t have a simple reproducer.
>
Thanks for the report! We will investigate.

> I wanted to share stack traces in
> order to help us further debug and understand if anyone else has run
> into this issue. We are currently working on getting more crashdumps
> and doing further analysis.
> 
> 
> The test setup looks like the following:
>   ┌─────┐
>   │mlx5 │
>   │NIC  │
>   └──┬──┘
>      │xdp ebpf program (does encap and XDP_TX)
>      │
>      ▼
>   ┌──────────────────────┐
>   │xdp.frags             │
>   │                      │
>   └──┬───────────────────┘
>      │tailcall
>      │BPF_REDIRECT_MAP (using CPUMAP bpf type)
>      ▼
>   ┌──────────────────────┐
>   │xdp.frags/cpumap      │
>   │                      │
>   └──┬───────────────────┘
>      │BPF_REDIRECT to veth (*potential trigger for issue)
>      │
>      ▼
>   ┌──────┐
>   │veth  │
>   │      │
>   └──┬───┘
>      │
>      │
>      ▼
> 
> Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> choose a random valid CPU to reproduce the issue. Once that packet
> reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> to a veth device which has an XDP program which redirects to an
> XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> veth device that we noticed this issue.
> 
Would it be possible to try to use a single program that redirects to
the XSKMAP and check that the issue reproduces?

> When running with 6.12.30 to 6.12.32 kernels we are able to see the
> following KASAN use-after-free WARNINGs followed by a page fault which
> crashes the machine. We have not been able to test earlier or later
> kernels. I’ve tried to map symbols to lines of code for clarity.
>
Thanks for the KASAN reports, they are very useful. Keep us posted
if you have other updates. A first quick look didn't reveal anything
obvious from our side but we will keep looking.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-04 12:37 ` Dragos Tatulea
@ 2025-07-04 20:14   ` Dragos Tatulea
  2025-07-07 22:07     ` Chris Arges
  2025-07-23 18:48   ` Chris Arges
  1 sibling, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-07-04 20:14 UTC (permalink / raw)
  To: Chris Arges, netdev, bpf
  Cc: kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On Fri, Jul 04, 2025 at 12:37:36PM +0000, Dragos Tatulea wrote:
> On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> > When running iperf through a set of XDP programs we were able to crash
> > machines with NICs using the mlx5_core driver. We were able to confirm
> > that other NICs/drivers did not exhibit the same problem, and suspect
> > this could be a memory management issue in the driver code.
> > Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> > mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> > issue in production using hardware, but cannot easily bisect because
> > we don’t have a simple reproducer.
> >
> Thanks for the report! We will investigate.
> 
> > I wanted to share stack traces in
> > order to help us further debug and understand if anyone else has run
> > into this issue. We are currently working on getting more crashdumps
> > and doing further analysis.
> > 
> > 
> > The test setup looks like the following:
> >   ┌─────┐
> >   │mlx5 │
> >   │NIC  │
> >   └──┬──┘
> >      │xdp ebpf program (does encap and XDP_TX)
> >      │
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags             │
> >   │                      │
> >   └──┬───────────────────┘
> >      │tailcall
> >      │BPF_REDIRECT_MAP (using CPUMAP bpf type)
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags/cpumap      │
> >   │                      │
> >   └──┬───────────────────┘
> >      │BPF_REDIRECT to veth (*potential trigger for issue)
> >      │
> >      ▼
> >   ┌──────┐
> >   │veth  │
> >   │      │
> >   └──┬───┘
> >      │
> >      │
> >      ▼
> > 
> > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> > choose a random valid CPU to reproduce the issue. Once that packet
> > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> > to a veth device which has an XDP program which redirects to an
> > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> > veth device that we noticed this issue.
> > 
> Would it be possible to try to use a single program that redirects to
> the XSKMAP and check that the issue reproduces?
>
I forgot to ask: what is the MTU size?
Also, are you setting any other special config on the device?
 
Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-04 20:14   ` Dragos Tatulea
@ 2025-07-07 22:07     ` Chris Arges
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Arges @ 2025-07-07 22:07 UTC (permalink / raw)
  To: Dragos Tatulea, netdev, bpf
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On Fri, Jul 04, 2025 at 08:14:20PM +0000, Dragos Tatulea wrote:
> On Fri, Jul 04, 2025 at 12:37:36PM +0000, Dragos Tatulea wrote:
> > On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> > > When running iperf through a set of XDP programs we were able to crash
> > > machines with NICs using the mlx5_core driver. We were able to confirm
> > > that other NICs/drivers did not exhibit the same problem, and suspect
> > > this could be a memory management issue in the driver code.
> > > Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> > > mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> > > issue in production using hardware, but cannot easily bisect because
> > > we don’t have a simple reproducer.
> > >
> > Thanks for the report! We will investigate.
> > 
> > > I wanted to share stack traces in
> > > order to help us further debug and understand if anyone else has run
> > > into this issue. We are currently working on getting more crashdumps
> > > and doing further analysis.
> > > 
> > > 
> > > The test setup looks like the following:
> > >   ┌─────┐
> > >   │mlx5 │
> > >   │NIC  │
> > >   └──┬──┘
> > >      │xdp ebpf program (does encap and XDP_TX)
> > >      │
> > >      ▼
> > >   ┌──────────────────────┐
> > >   │xdp.frags             │
> > >   │                      │
> > >   └──┬───────────────────┘
> > >      │tailcall
> > >      │BPF_REDIRECT_MAP (using CPUMAP bpf type)
> > >      ▼
> > >   ┌──────────────────────┐
> > >   │xdp.frags/cpumap      │
> > >   │                      │
> > >   └──┬───────────────────┘
> > >      │BPF_REDIRECT to veth (*potential trigger for issue)
> > >      │
> > >      ▼
> > >   ┌──────┐
> > >   │veth  │
> > >   │      │
> > >   └──┬───┘
> > >      │
> > >      │
> > >      ▼
> > > 
> > > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> > > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> > > choose a random valid CPU to reproduce the issue. Once that packet
> > > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> > > to a veth device which has an XDP program which redirects to an
> > > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> > > veth device that we noticed this issue.
> > > 
> > Would it be possible to try to use a single program that redirects to
> > the XSKMAP and check that the issue reproduces?
> >
> I forgot to ask: what is the MTU size?
> Also, are you setting any other special config on the device?
>  
> Thanks,
> Dragos

Dragos,

The device has the following settings:
2: ext0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1600 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:48:7f:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 520 numrxqueues 65 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 gso_ipv4_max_size 65536 gro_ipv4_max_size 65536 portname p0 switchid e87f480003da341c parentbus pci parentdev 0000:c1:00.0
    prog/xdp id 173

As far as testing other packet paths to help narrow down the problem we tested:

1) Fails: XDP (mlx5 nic) -> CPU MAP -> DEV MAP (to veth) -> XSK
2) Works: XDP (mlx5 nic) -> CPU MAP -> Linux routing (to veth) -> XSK
3) Works: XDP (mlx5 nic) -> Linux routing (to veth) -> XSK

Given those cases, I would think a single program that redirects just to XSKMAP
would also work fine.

Thanks,
--chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-04 12:37 ` Dragos Tatulea
  2025-07-04 20:14   ` Dragos Tatulea
@ 2025-07-23 18:48   ` Chris Arges
  2025-07-24 17:01     ` Dragos Tatulea
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Arges @ 2025-07-23 18:48 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On 2025-07-04 12:37:36, Dragos Tatulea wrote:
> On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> > When running iperf through a set of XDP programs we were able to crash
> > machines with NICs using the mlx5_core driver. We were able to confirm
> > that other NICs/drivers did not exhibit the same problem, and suspect
> > this could be a memory management issue in the driver code.
> > Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> > mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> > issue in production using hardware, but cannot easily bisect because
> > we don’t have a simple reproducer.
> >
> Thanks for the report! We will investigate.
> 
> > I wanted to share stack traces in
> > order to help us further debug and understand if anyone else has run
> > into this issue. We are currently working on getting more crashdumps
> > and doing further analysis.
> > 
> > 
> > The test setup looks like the following:
> >   ┌─────┐
> >   │mlx5 │
> >   │NIC  │
> >   └──┬──┘
> >      │xdp ebpf program (does encap and XDP_TX)
> >      │
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags             │
> >   │                      │
> >   └──┬───────────────────┘
> >      │tailcall
> >      │BPF_REDIRECT_MAP (using CPUMAP bpf type)
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags/cpumap      │
> >   │                      │
> >   └──┬───────────────────┘
> >      │BPF_REDIRECT to veth (*potential trigger for issue)
> >      │
> >      ▼
> >   ┌──────┐
> >   │veth  │
> >   │      │
> >   └──┬───┘
> >      │
> >      │
> >      ▼
> > 
> > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> > choose a random valid CPU to reproduce the issue. Once that packet
> > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> > to a veth device which has an XDP program which redirects to an
> > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> > veth device that we noticed this issue.
> > 
> Would it be possible to try to use a single program that redirects to
> the XSKMAP and check that the issue reproduces?
> 
> > When running with 6.12.30 to 6.12.32 kernels we are able to see the
> > following KASAN use-after-free WARNINGs followed by a page fault which
> > crashes the machine. We have not been able to test earlier or later
> > kernels. I’ve tried to map symbols to lines of code for clarity.
> >
> Thanks for the KASAN reports, they are very useful. Keep us posted
> if you have other updates. A first quick look didn't reveal anything
> obvious from our side but we will keep looking.
> 
> Thanks,
> Dragos

Ok, we can reproduce this problem!

I tried to simplify this reproducer, but it seems like what's needed is:
- xdp program attached to mlx5 NIC
- cpumap redirect
- device redirect (map or just bpf_redirect)
- frame gets turned into an skb
Then from another machine send many flows of UDP traffic to trigger the problem.

I've put together a program that reproduces the issue here:
- https://github.com/arges/xdp-redirector

In general the failure manifests with many different WARNs such as:
include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0xf7/0x150 [mlx5_core]
Then the machine crashes.

I was able to get a crashdump which shows:
```
PID: 0        TASK: ffff8c0910134380  CPU: 76   COMMAND: "swapper/76"
 #0 [fffffe10906d3ea8] crash_nmi_callback at ffffffffadc5c4fd
 #1 [fffffe10906d3eb0] default_do_nmi at ffffffffae9524f0
 #2 [fffffe10906d3ed0] exc_nmi at ffffffffae952733
 #3 [fffffe10906d3ef0] end_repeat_nmi at ffffffffaea01bfd
    [exception RIP: io_serial_in+25]
    RIP: ffffffffae4cd489  RSP: ffffb3c60d6049e8  RFLAGS: 00000002
    RAX: ffffffffae4cd400  RBX: 00000000000025d8  RCX: 0000000000000000
    RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffffb10a9cb0
    RBP: 0000000000000000   R8: 2d2d2d2d2d2d2d2d   R9: 656820747563205b
    R10: 000000002d2d2d2d  R11: 000000002d2d2d2d  R12: ffffffffb0fa5610
    R13: 0000000000000000  R14: 0000000000000000  R15: ffffffffb10a9cb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffffb3c60d6049e8] io_serial_in at ffffffffae4cd489
 #5 [ffffb3c60d6049e8] serial8250_console_write at ffffffffae4d2fcf
 #6 [ffffb3c60d604a80] console_flush_all at ffffffffadd1cf26
 #7 [ffffb3c60d604b00] console_unlock at ffffffffadd1d1df
 #8 [ffffb3c60d604b48] vprintk_emit at ffffffffadd1dda1
 #9 [ffffb3c60d604b98] _printk at ffffffffae90250c
#10 [ffffb3c60d604bf8] report_bug.cold at ffffffffae95001d
#11 [ffffb3c60d604c38] handle_bug at ffffffffae950e91
#12 [ffffb3c60d604c58] exc_invalid_op at ffffffffae9512b7
#13 [ffffb3c60d604c70] asm_exc_invalid_op at ffffffffaea0123a
    [exception RIP: mlx5e_page_release_fragmented+85]
    RIP: ffffffffc25f75c5  RSP: ffffb3c60d604d20  RFLAGS: 00010293
    RAX: 000000000000003f  RBX: ffff8bfa8f059fd0  RCX: ffffe3bf1992a180
    RDX: 000000000000003d  RSI: ffffe3bf1992a180  RDI: ffff8bf9b0784000
    RBP: 0000000000000040   R8: 00000000000001d2   R9: 0000000000000006
    R10: ffff8c06de22f380  R11: ffff8bfcfe6cd680  R12: 00000000000001d2
    R13: 000000000000002b  R14: ffff8bf9b0784000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#14 [ffffb3c60d604d20] mlx5e_free_rx_wqes at ffffffffc25f7e2f [mlx5_core]
#15 [ffffb3c60d604d58] mlx5e_post_rx_wqes at ffffffffc25f877c [mlx5_core]
#16 [ffffb3c60d604dc0] mlx5e_napi_poll at ffffffffc25fdd27 [mlx5_core]
#17 [ffffb3c60d604e20] __napi_poll at ffffffffae6a8ddb
#18 [ffffb3c60d604e90] __napi_poll at ffffffffae6a8db5
#19 [ffffb3c60d604e98] net_rx_action at ffffffffae6a95f1
#20 [ffffb3c60d604f98] handle_softirqs at ffffffffadc9d4bf
#21 [ffffb3c60d604fe8] irq_exit_rcu at ffffffffadc9e057
#22 [ffffb3c60d604ff0] common_interrupt at ffffffffae952015
--- <IRQ stack> ---
#23 [ffffb3c60c837de8] asm_common_interrupt at ffffffffaea01466
    [exception RIP: cpuidle_enter_state+184]
    RIP: ffffffffae955c38  RSP: ffffb3c60c837e98  RFLAGS: 00000202
    RAX: ffff8c0cffc00000  RBX: ffff8c0911002400  RCX: 0000000000000000
    RDX: 00003c630b2d073a  RSI: ffffffe519600d10  RDI: 0000000000000000
    RBP: 0000000000000001   R8: 0000000000000002   R9: 0000000000000001
    R10: ffff8c0cffc330c4  R11: 071c71c71c71c71c  R12: ffffffffb05ff820
    R13: 00003c630b2d073a  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#24 [ffffb3c60c837ed0] cpuidle_enter at ffffffffae64b4ad
#25 [ffffb3c60c837ef0] do_idle at ffffffffadcfa7c6
#26 [ffffb3c60c837f30] cpu_startup_entry at ffffffffadcfaa09
#27 [ffffb3c60c837f40] start_secondary at ffffffffadc5ec77
#28 [ffffb3c60c837f50] common_startup_64 at ffffffffadc24d5d
```

Assuming (this is x86_64):
RDI=ffff8bf9b0784000 (rq)
RSI=ffffe3bf1992a180 (frag_page)

```
static void mlx5e_page_release_fragmented(struct mlx5e_rq *rq,
                                          struct mlx5e_frag_page *frag_page)
{
        u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
        struct page *page = frag_page->page;

        if (page_pool_unref_page(page, drain_count) == 0)
                page_pool_put_unrefed_page(rq->page_pool, page, -1, true);
}
```

crash> struct mlx5e_frag_page ffffe3bf1992a180
struct mlx5e_frag_page {
  page = 0x26ffff800000000,
  frags = 49856
}

This means that drain_count could be an unexpected number (assuming that we
expect it to be less than MLX5E_PAGECNT_BIAS_MAX).

Let me know what additional experiments would be useful here.

--chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-23 18:48   ` Chris Arges
@ 2025-07-24 17:01     ` Dragos Tatulea
  2025-08-07 16:45       ` Chris Arges
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-07-24 17:01 UTC (permalink / raw)
  To: Chris Arges
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote:
> 
> Ok, we can reproduce this problem!
> 
> I tried to simplify this reproducer, but it seems like what's needed is:
> - xdp program attached to mlx5 NIC
> - cpumap redirect
> - device redirect (map or just bpf_redirect)
> - frame gets turned into an skb
> Then from another machine send many flows of UDP traffic to trigger the problem.
> 
> I've put together a program that reproduces the issue here:
> - https://github.com/arges/xdp-redirector
>
Much appreciated! I fumbled around initially, not managing to get
traffic to the xdp_devmap stage. But further debugging revealed that GRO
needs to be enabled on the veth devices for XDP redir to work to the
xdp_devmap. After that I managed to reproduce your issue.

Now I can start looking into it.

> In general the failure manifests with many different WARNs such as:
> include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0xf7/0x150 [mlx5_core]
> Then the machine crashes.
> 
> I was able to get a crashdump which shows:
> ```
> PID: 0        TASK: ffff8c0910134380  CPU: 76   COMMAND: "swapper/76"
>  #0 [fffffe10906d3ea8] crash_nmi_callback at ffffffffadc5c4fd
>  #1 [fffffe10906d3eb0] default_do_nmi at ffffffffae9524f0
>  #2 [fffffe10906d3ed0] exc_nmi at ffffffffae952733
>  #3 [fffffe10906d3ef0] end_repeat_nmi at ffffffffaea01bfd
>     [exception RIP: io_serial_in+25]
>     RIP: ffffffffae4cd489  RSP: ffffb3c60d6049e8  RFLAGS: 00000002
>     RAX: ffffffffae4cd400  RBX: 00000000000025d8  RCX: 0000000000000000
>     RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffffb10a9cb0
>     RBP: 0000000000000000   R8: 2d2d2d2d2d2d2d2d   R9: 656820747563205b
>     R10: 000000002d2d2d2d  R11: 000000002d2d2d2d  R12: ffffffffb0fa5610
>     R13: 0000000000000000  R14: 0000000000000000  R15: ffffffffb10a9cb0
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffffb3c60d6049e8] io_serial_in at ffffffffae4cd489
>  #5 [ffffb3c60d6049e8] serial8250_console_write at ffffffffae4d2fcf
>  #6 [ffffb3c60d604a80] console_flush_all at ffffffffadd1cf26
>  #7 [ffffb3c60d604b00] console_unlock at ffffffffadd1d1df
>  #8 [ffffb3c60d604b48] vprintk_emit at ffffffffadd1dda1
>  #9 [ffffb3c60d604b98] _printk at ffffffffae90250c
> #10 [ffffb3c60d604bf8] report_bug.cold at ffffffffae95001d
> #11 [ffffb3c60d604c38] handle_bug at ffffffffae950e91
> #12 [ffffb3c60d604c58] exc_invalid_op at ffffffffae9512b7
> #13 [ffffb3c60d604c70] asm_exc_invalid_op at ffffffffaea0123a
>     [exception RIP: mlx5e_page_release_fragmented+85]
>     RIP: ffffffffc25f75c5  RSP: ffffb3c60d604d20  RFLAGS: 00010293
>     RAX: 000000000000003f  RBX: ffff8bfa8f059fd0  RCX: ffffe3bf1992a180
>     RDX: 000000000000003d  RSI: ffffe3bf1992a180  RDI: ffff8bf9b0784000
>     RBP: 0000000000000040   R8: 00000000000001d2   R9: 0000000000000006
>     R10: ffff8c06de22f380  R11: ffff8bfcfe6cd680  R12: 00000000000001d2
>     R13: 000000000000002b  R14: ffff8bf9b0784000  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #14 [ffffb3c60d604d20] mlx5e_free_rx_wqes at ffffffffc25f7e2f [mlx5_core]
> #15 [ffffb3c60d604d58] mlx5e_post_rx_wqes at ffffffffc25f877c [mlx5_core]
> #16 [ffffb3c60d604dc0] mlx5e_napi_poll at ffffffffc25fdd27 [mlx5_core]
> #17 [ffffb3c60d604e20] __napi_poll at ffffffffae6a8ddb
> #18 [ffffb3c60d604e90] __napi_poll at ffffffffae6a8db5
> #19 [ffffb3c60d604e98] net_rx_action at ffffffffae6a95f1
> #20 [ffffb3c60d604f98] handle_softirqs at ffffffffadc9d4bf
> #21 [ffffb3c60d604fe8] irq_exit_rcu at ffffffffadc9e057
> #22 [ffffb3c60d604ff0] common_interrupt at ffffffffae952015
> --- <IRQ stack> ---
> #23 [ffffb3c60c837de8] asm_common_interrupt at ffffffffaea01466
>     [exception RIP: cpuidle_enter_state+184]
>     RIP: ffffffffae955c38  RSP: ffffb3c60c837e98  RFLAGS: 00000202
>     RAX: ffff8c0cffc00000  RBX: ffff8c0911002400  RCX: 0000000000000000
>     RDX: 00003c630b2d073a  RSI: ffffffe519600d10  RDI: 0000000000000000
>     RBP: 0000000000000001   R8: 0000000000000002   R9: 0000000000000001
>     R10: ffff8c0cffc330c4  R11: 071c71c71c71c71c  R12: ffffffffb05ff820
>     R13: 00003c630b2d073a  R14: 0000000000000001  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #24 [ffffb3c60c837ed0] cpuidle_enter at ffffffffae64b4ad
> #25 [ffffb3c60c837ef0] do_idle at ffffffffadcfa7c6
> #26 [ffffb3c60c837f30] cpu_startup_entry at ffffffffadcfaa09
> #27 [ffffb3c60c837f40] start_secondary at ffffffffadc5ec77
> #28 [ffffb3c60c837f50] common_startup_64 at ffffffffadc24d5d
> ```
> 
> Assuming (this is x86_64):
> RDI=ffff8bf9b0784000 (rq)
> RSI=ffffe3bf1992a180 (frag_page)
> 
> ```
> static void mlx5e_page_release_fragmented(struct mlx5e_rq *rq,
>                                           struct mlx5e_frag_page *frag_page)
> {
>         u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
>         struct page *page = frag_page->page;
> 
>         if (page_pool_unref_page(page, drain_count) == 0)
>                 page_pool_put_unrefed_page(rq->page_pool, page, -1, true);
> }
> ```
> 
> crash> struct mlx5e_frag_page ffffe3bf1992a180
> struct mlx5e_frag_page {
>   page = 0x26ffff800000000,
>   frags = 49856
> }
>
Most incorrect fragment counting issues have a tendency to show up here.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-07-24 17:01     ` Dragos Tatulea
@ 2025-08-07 16:45       ` Chris Arges
  2025-08-11  8:37         ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Arges @ 2025-08-07 16:45 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On 2025-07-24 17:01:16, Dragos Tatulea wrote:
> On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote:
> > 
> > Ok, we can reproduce this problem!
> > 
> > I tried to simplify this reproducer, but it seems like what's needed is:
> > - xdp program attached to mlx5 NIC
> > - cpumap redirect
> > - device redirect (map or just bpf_redirect)
> > - frame gets turned into an skb
> > Then from another machine send many flows of UDP traffic to trigger the problem.
> > 
> > I've put together a program that reproduces the issue here:
> > - https://github.com/arges/xdp-redirector
> >
> Much appreciated! I fumbled around initially, not managing to get
> traffic to the xdp_devmap stage. But further debugging revealed that GRO
> needs to be enabled on the veth devices for XDP redir to work to the
> xdp_devmap. After that I managed to reproduce your issue.
> 
> Now I can start looking into it.
> 

Dragos,

There was a similar reference counting issue identified in:
https://lore.kernel.org/all/20250801170754.2439577-1-kuba@kernel.org/

Part of the commit message mentioned:
> Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization
> of SKB coalescing for page pool") core _may_ actually take two extra
> pp refcounts, if one of them is returned before driver gives up the bias
> the ret < 0 check in page_pool_unref_netmem() will trigger.

In order to help debug the mlx5 issue caused by xdp redirection, I built a
kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able
to reproduce the issue.

I am happy to try some other experiments, or if there are other ideas you have.

Thanks,
--chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-07 16:45       ` Chris Arges
@ 2025-08-11  8:37         ` Dragos Tatulea
  2025-08-12 15:44           ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-11  8:37 UTC (permalink / raw)
  To: Chris Arges
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

Hi Chris,

Sorry for the late reply, I was on holiday.

On Thu, Aug 07, 2025 at 11:45:40AM -0500, Chris Arges wrote:
> On 2025-07-24 17:01:16, Dragos Tatulea wrote:
> > On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote:
> > > 
> > > Ok, we can reproduce this problem!
> > > 
> > > I tried to simplify this reproducer, but it seems like what's needed is:
> > > - xdp program attached to mlx5 NIC
> > > - cpumap redirect
> > > - device redirect (map or just bpf_redirect)
> > > - frame gets turned into an skb
> > > Then from another machine send many flows of UDP traffic to trigger the problem.
> > > 
> > > I've put together a program that reproduces the issue here:
> > > - https://github.com/arges/xdp-redirector
> > >
> > Much appreciated! I fumbled around initially, not managing to get
> > traffic to the xdp_devmap stage. But further debugging revealed that GRO
> > needs to be enabled on the veth devices for XDP redir to work to the
> > xdp_devmap. After that I managed to reproduce your issue.
> > 
> > Now I can start looking into it.
> > 
> 
> Dragos,
> 
> There was a similar reference counting issue identified in:
> https://lore.kernel.org/all/20250801170754.2439577-1-kuba@kernel.org/
> 
> Part of the commit message mentioned:
> > Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization
> > of SKB coalescing for page pool") core _may_ actually take two extra
> > pp refcounts, if one of them is returned before driver gives up the bias
> > the ret < 0 check in page_pool_unref_netmem() will trigger.
> 
> In order to help debug the mlx5 issue caused by xdp redirection, I built a
> kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able
> to reproduce the issue.
Thanks for trying this.

> 
> I am happy to try some other experiments, or if there are other ideas you have.
>
I am actively debugging the issue but progress is slow as it is not an
easy one. So far I have been able to trace it back to the fact that the
page_pool is returning the same page twice on allocation without having a
release in between. As this is quite weird, I think I still have to
trace it back a few more steps to find the actual issue.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-11  8:37         ` Dragos Tatulea
@ 2025-08-12 15:44           ` Dragos Tatulea
  2025-08-12 18:55             ` Jesse Brandeburg
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-12 15:44 UTC (permalink / raw)
  To: Chris Arges
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

Hi Chris,

On Mon, Aug 11, 2025 at 08:37:56AM +0000, Dragos Tatulea wrote:
> Hi Chris,
> 
> Sorry for the late reply, I was on holiday.
> 
> On Thu, Aug 07, 2025 at 11:45:40AM -0500, Chris Arges wrote:
> > On 2025-07-24 17:01:16, Dragos Tatulea wrote:
> > > On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote:
> > > > 
> > > > Ok, we can reproduce this problem!
> > > > 
> > > > I tried to simplify this reproducer, but it seems like what's needed is:
> > > > - xdp program attached to mlx5 NIC
> > > > - cpumap redirect
> > > > - device redirect (map or just bpf_redirect)
> > > > - frame gets turned into an skb
> > > > Then from another machine send many flows of UDP traffic to trigger the problem.
> > > > 
> > > > I've put together a program that reproduces the issue here:
> > > > - https://github.com/arges/xdp-redirector
> > > >
> > > Much appreciated! I fumbled around initially, not managing to get
> > > traffic to the xdp_devmap stage. But further debugging revealed that GRO
> > > needs to be enabled on the veth devices for XDP redir to work to the
> > > xdp_devmap. After that I managed to reproduce your issue.
> > > 
> > > Now I can start looking into it.
> > > 
> > 
> > Dragos,
> > 
> > There was a similar reference counting issue identified in:
> > https://lore.kernel.org/all/20250801170754.2439577-1-kuba@kernel.org/
> > 
> > Part of the commit message mentioned:
> > > Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization
> > > of SKB coalescing for page pool") core _may_ actually take two extra
> > > pp refcounts, if one of them is returned before driver gives up the bias
> > > the ret < 0 check in page_pool_unref_netmem() will trigger.
> > 
> > In order to help debug the mlx5 issue caused by xdp redirection, I built a
> > kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able
> > to reproduce the issue.
> Thanks for trying this.
> 
> > 
> > I am happy to try some other experiments, or if there are other ideas you have.
> >
> I am actively debugging the issue but progress is slow as it is not an
> easy one. So far I have been able to trace it back to the fact that the
> page_pool is returning the same page twice on allocation without having a
> release in between. As this is quite weird, I think I still have to
> trace it back a few more steps to find the actual issue.
>
Ok, so I think I've found the issue: there's some place which recycles
pages to the page_pool cache directly while running from a different CPU
than it should.

This happens when dropping frames during the __dev_flush() of the device
map from the cpumap cpu. Here's the call graph:
-> cpu_map_bpf_prog_run()
  -> xdp_do_flush (on redirects)
    -> __dev_flush()
      -> bq_xmit_all()
        -> xdp_return_frame_rx_napi() (called on drop)
          -> page_pool_put_full_netmem(pp, page, true) (always set to
	  true)

So normally xdp_do_flush() is called by the driver which happens from
the right NAPI context. But for cpumap + redirect it is called from the
cpumap CPU. So returning frames in this countext should be done with the
"no direct" flag set.

Could you try the below patch and check if you still get the crash? The
patch fixes specifically this flow, but I wonder if there are similar
places where this protection is missing.

Patch:

---
 kernel/bpf/devmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 482d284a1553..484216c7454d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
        /* If not all frames have been transmitted, it is our
         * responsibility to free them
         */
+       xdp_set_return_frame_no_direct();
        for (i = sent; unlikely(i < to_send); i++)
                xdp_return_frame_rx_napi(bq->q[i]);
+       xdp_clear_return_frame_no_direct();
 
 out:
        bq->count = 0;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-12 15:44           ` Dragos Tatulea
@ 2025-08-12 18:55             ` Jesse Brandeburg
  2025-08-12 20:19               ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Jesse Brandeburg @ 2025-08-12 18:55 UTC (permalink / raw)
  To: Dragos Tatulea, Chris Arges
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai, hawk

On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 482d284a1553..484216c7454d 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>          /* If not all frames have been transmitted, it is our
>           * responsibility to free them
>           */
> +       xdp_set_return_frame_no_direct();
>          for (i = sent; unlikely(i < to_send); i++)
>                  xdp_return_frame_rx_napi(bq->q[i]);
> +       xdp_clear_return_frame_no_direct();

Why can't this instead just be xdp_return_frame(bq->q[i]); with no 
"no_direct" fussing?

Wouldn't this be the safest way for this function to call frame 
completion? It seems like presuming the calling context is napi is wrong?

The other option here seems to be using the xdp_return_frame_bulk() but 
you'd need to be careful to make sure the rcu lock was taken or already 
held, but it should already be, since it's taken inside xdp_do_flush.

>   
>   out:
>          bq->count = 0;




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-12 18:55             ` Jesse Brandeburg
@ 2025-08-12 20:19               ` Dragos Tatulea
  2025-08-12 21:25                 ` Chris Arges
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-12 20:19 UTC (permalink / raw)
  To: Jesse Brandeburg, Chris Arges
  Cc: netdev, bpf, kernel-team, Jesper Dangaard Brouer, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai

On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> 
> > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > index 482d284a1553..484216c7454d 100644
> > --- a/kernel/bpf/devmap.c
> > +++ b/kernel/bpf/devmap.c
> > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >          /* If not all frames have been transmitted, it is our
> >           * responsibility to free them
> >           */
> > +       xdp_set_return_frame_no_direct();
> >          for (i = sent; unlikely(i < to_send); i++)
> >                  xdp_return_frame_rx_napi(bq->q[i]);
> > +       xdp_clear_return_frame_no_direct();
> 
> Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> "no_direct" fussing?
> 
> Wouldn't this be the safest way for this function to call frame completion?
> It seems like presuming the calling context is napi is wrong?
>
It would be better indeed. Thanks for removing my horse glasses!

Once Chris verifies that this works for him I can prepare a fix patch.

> The other option here seems to be using the xdp_return_frame_bulk() but
> you'd need to be careful to make sure the rcu lock was taken or already
> held, but it should already be, since it's taken inside xdp_do_flush.
>
That would be even better, but bq_xmit_all() is also called by
bq_enqueue() which doesn't seem to have the rcu lock taken.

Thannks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-12 20:19               ` Dragos Tatulea
@ 2025-08-12 21:25                 ` Chris Arges
  2025-08-13 18:53                   ` Chris Arges
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Arges @ 2025-08-12 21:25 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team,
	Jesper Dangaard Brouer, tariqt, saeedm, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai

On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > 
> > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > index 482d284a1553..484216c7454d 100644
> > > --- a/kernel/bpf/devmap.c
> > > +++ b/kernel/bpf/devmap.c
> > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >          /* If not all frames have been transmitted, it is our
> > >           * responsibility to free them
> > >           */
> > > +       xdp_set_return_frame_no_direct();
> > >          for (i = sent; unlikely(i < to_send); i++)
> > >                  xdp_return_frame_rx_napi(bq->q[i]);
> > > +       xdp_clear_return_frame_no_direct();
> > 
> > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > "no_direct" fussing?
> > 
> > Wouldn't this be the safest way for this function to call frame completion?
> > It seems like presuming the calling context is napi is wrong?
> >
> It would be better indeed. Thanks for removing my horse glasses!
> 
> Once Chris verifies that this works for him I can prepare a fix patch.
>
Working on that now, I'm testing a kernel with the following change:

---

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3aa002a47..ef86d9e06 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
         * responsibility to free them
         */
        for (i = sent; unlikely(i < to_send); i++)
-               xdp_return_frame_rx_napi(bq->q[i]);
+               xdp_return_frame(bq->q[i]);
 
 out:
        bq->count = 0;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-12 21:25                 ` Chris Arges
@ 2025-08-13 18:53                   ` Chris Arges
  2025-08-13 19:26                     ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Arges @ 2025-08-13 18:53 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team,
	Jesper Dangaard Brouer, tariqt, saeedm, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai

On 2025-08-12 16:25:58, Chris Arges wrote:
> On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> > On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > > 
> > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > index 482d284a1553..484216c7454d 100644
> > > > --- a/kernel/bpf/devmap.c
> > > > +++ b/kernel/bpf/devmap.c
> > > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > >          /* If not all frames have been transmitted, it is our
> > > >           * responsibility to free them
> > > >           */
> > > > +       xdp_set_return_frame_no_direct();
> > > >          for (i = sent; unlikely(i < to_send); i++)
> > > >                  xdp_return_frame_rx_napi(bq->q[i]);
> > > > +       xdp_clear_return_frame_no_direct();
> > > 
> > > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > > "no_direct" fussing?
> > > 
> > > Wouldn't this be the safest way for this function to call frame completion?
> > > It seems like presuming the calling context is napi is wrong?
> > >
> > It would be better indeed. Thanks for removing my horse glasses!
> > 
> > Once Chris verifies that this works for him I can prepare a fix patch.
> >
> Working on that now, I'm testing a kernel with the following change:
> 
> ---
> 
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 3aa002a47..ef86d9e06 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>          * responsibility to free them
>          */
>         for (i = sent; unlikely(i < to_send); i++)
> -               xdp_return_frame_rx_napi(bq->q[i]);
> +               xdp_return_frame(bq->q[i]);
>  
>  out:
>         bq->count = 0;

This patch resolves the issue I was seeing and I am no longer able to
reproduce the issue. I tested for about 2 hours, when the reproducer usually
takes about 1-2 minutes.

--chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-13 18:53                   ` Chris Arges
@ 2025-08-13 19:26                     ` Dragos Tatulea
  2025-08-13 20:24                       ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-13 19:26 UTC (permalink / raw)
  To: Chris Arges
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team,
	Jesper Dangaard Brouer, tariqt, saeedm, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai

On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
> On 2025-08-12 16:25:58, Chris Arges wrote:
> > On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> > > On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > > > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > > > 
> > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > index 482d284a1553..484216c7454d 100644
> > > > > --- a/kernel/bpf/devmap.c
> > > > > +++ b/kernel/bpf/devmap.c
> > > > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > >          /* If not all frames have been transmitted, it is our
> > > > >           * responsibility to free them
> > > > >           */
> > > > > +       xdp_set_return_frame_no_direct();
> > > > >          for (i = sent; unlikely(i < to_send); i++)
> > > > >                  xdp_return_frame_rx_napi(bq->q[i]);
> > > > > +       xdp_clear_return_frame_no_direct();
> > > > 
> > > > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > > > "no_direct" fussing?
> > > > 
> > > > Wouldn't this be the safest way for this function to call frame completion?
> > > > It seems like presuming the calling context is napi is wrong?
> > > >
> > > It would be better indeed. Thanks for removing my horse glasses!
> > > 
> > > Once Chris verifies that this works for him I can prepare a fix patch.
> > >
> > Working on that now, I'm testing a kernel with the following change:
> > 
> > ---
> > 
> > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > index 3aa002a47..ef86d9e06 100644
> > --- a/kernel/bpf/devmap.c
> > +++ b/kernel/bpf/devmap.c
> > @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >          * responsibility to free them
> >          */
> >         for (i = sent; unlikely(i < to_send); i++)
> > -               xdp_return_frame_rx_napi(bq->q[i]);
> > +               xdp_return_frame(bq->q[i]);
> >  
> >  out:
> >         bq->count = 0;
> 
> This patch resolves the issue I was seeing and I am no longer able to
> reproduce the issue. I tested for about 2 hours, when the reproducer usually
> takes about 1-2 minutes.
>
Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.

As follow up work it would be good to have a way to catch this family of
issues. Something in the lines of the patch below.

Thanks,
Dragos

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f1373756cd0f..0c498fbd8df6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 {
        lockdep_assert_no_hardirq();
 
+#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
+       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
+#endif
+
        /* This allocator is optimized for the XDP mode that uses
         * one-frame-per-page, but have fallbacks that act like the
         * regular page allocator APIs.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-13 19:26                     ` Dragos Tatulea
@ 2025-08-13 20:24                       ` Dragos Tatulea
  2025-08-14 11:26                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-13 20:24 UTC (permalink / raw)
  To: Chris Arges
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team,
	Jesper Dangaard Brouer, tariqt, saeedm, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai

On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
> On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
> > On 2025-08-12 16:25:58, Chris Arges wrote:
> > > On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> > > > On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > > > > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > > > > 
> > > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > > index 482d284a1553..484216c7454d 100644
> > > > > > --- a/kernel/bpf/devmap.c
> > > > > > +++ b/kernel/bpf/devmap.c
> > > > > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > > >          /* If not all frames have been transmitted, it is our
> > > > > >           * responsibility to free them
> > > > > >           */
> > > > > > +       xdp_set_return_frame_no_direct();
> > > > > >          for (i = sent; unlikely(i < to_send); i++)
> > > > > >                  xdp_return_frame_rx_napi(bq->q[i]);
> > > > > > +       xdp_clear_return_frame_no_direct();
> > > > > 
> > > > > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > > > > "no_direct" fussing?
> > > > > 
> > > > > Wouldn't this be the safest way for this function to call frame completion?
> > > > > It seems like presuming the calling context is napi is wrong?
> > > > >
> > > > It would be better indeed. Thanks for removing my horse glasses!
> > > > 
> > > > Once Chris verifies that this works for him I can prepare a fix patch.
> > > >
> > > Working on that now, I'm testing a kernel with the following change:
> > > 
> > > ---
> > > 
> > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > index 3aa002a47..ef86d9e06 100644
> > > --- a/kernel/bpf/devmap.c
> > > +++ b/kernel/bpf/devmap.c
> > > @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >          * responsibility to free them
> > >          */
> > >         for (i = sent; unlikely(i < to_send); i++)
> > > -               xdp_return_frame_rx_napi(bq->q[i]);
> > > +               xdp_return_frame(bq->q[i]);
> > >  
> > >  out:
> > >         bq->count = 0;
> > 
> > This patch resolves the issue I was seeing and I am no longer able to
> > reproduce the issue. I tested for about 2 hours, when the reproducer usually
> > takes about 1-2 minutes.
> >
> Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.
> 
> As follow up work it would be good to have a way to catch this family of
> issues. Something in the lines of the patch below.
> 
> Thanks,
> Dragos
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index f1373756cd0f..0c498fbd8df6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>  {
>         lockdep_assert_no_hardirq();
>  
> +#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
> +       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
I meant to negate the condition here.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-13 20:24                       ` Dragos Tatulea
@ 2025-08-14 11:26                         ` Jesper Dangaard Brouer
  2025-08-14 14:42                           ` Dragos Tatulea
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-14 11:26 UTC (permalink / raw)
  To: Dragos Tatulea, Chris Arges
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai,
	Kumar Kartikeya Dwivedi

[-- Attachment #1: Type: text/plain, Size: 4328 bytes --]



On 13/08/2025 22.24, Dragos Tatulea wrote:
> On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
>> On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
>>> On 2025-08-12 16:25:58, Chris Arges wrote:
>>>> On 2025-08-12 20:19:30, Dragos Tatulea wrote:
>>>>> On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
>>>>>> On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
>>>>>>
>>>>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>>>>>>> index 482d284a1553..484216c7454d 100644
>>>>>>> --- a/kernel/bpf/devmap.c
>>>>>>> +++ b/kernel/bpf/devmap.c
>>>>>>> @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>>>>           /* If not all frames have been transmitted, it is our
>>>>>>>            * responsibility to free them
>>>>>>>            */
>>>>>>> +       xdp_set_return_frame_no_direct();
>>>>>>>           for (i = sent; unlikely(i < to_send); i++)
>>>>>>>                   xdp_return_frame_rx_napi(bq->q[i]);
>>>>>>> +       xdp_clear_return_frame_no_direct();
>>>>>>
>>>>>> Why can't this instead just be xdp_return_frame(bq->q[i]); with no
>>>>>> "no_direct" fussing?
>>>>>>
>>>>>> Wouldn't this be the safest way for this function to call frame completion?
>>>>>> It seems like presuming the calling context is napi is wrong?
>>>>>>
>>>>> It would be better indeed. Thanks for removing my horse glasses!
>>>>>
>>>>> Once Chris verifies that this works for him I can prepare a fix patch.
>>>>>
>>>> Working on that now, I'm testing a kernel with the following change:
>>>>
>>>> ---
>>>>
>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>>>> index 3aa002a47..ef86d9e06 100644
>>>> --- a/kernel/bpf/devmap.c
>>>> +++ b/kernel/bpf/devmap.c
>>>> @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>           * responsibility to free them
>>>>           */
>>>>          for (i = sent; unlikely(i < to_send); i++)
>>>> -               xdp_return_frame_rx_napi(bq->q[i]);
>>>> +               xdp_return_frame(bq->q[i]);
>>>>   
>>>>   out:
>>>>          bq->count = 0;
>>>
>>> This patch resolves the issue I was seeing and I am no longer able to
>>> reproduce the issue. I tested for about 2 hours, when the reproducer usually
>>> takes about 1-2 minutes.
>>>
>> Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.
>>

Looking at code ... there are more cases we need to deal with.
If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.

The normal way to fix this is to use the helpers:
  - xdp_set_return_frame_no_direct();
  - xdp_clear_return_frame_no_direct()

Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
disable those napi_direct requests.

  [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439

Something doesn't add-up, because the remote CPUMAP bpf-prog that 
redirects to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that 
function already uses the xdp_set_return_frame_no_direct() helper.

  [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189

I see the bug now... attached a patch with the fix.
The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.

Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement 
generic cpumap") v5.15.

>> As follow up work it would be good to have a way to catch this family of
>> issues. Something in the lines of the patch below.
>>

Yes, please, we want something that can catch these kind of hard to find 
bugs.

>> Thanks,
>> Dragos
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index f1373756cd0f..0c498fbd8df6 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>   {
>>          lockdep_assert_no_hardirq();
>>   
>> +#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
>> +       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
> I meant to negate the condition here.
> 

The XDP code have evolved since the xdp_set_return_frame_no_direct()
calls were added.  Now page_pool keeps track of pp->napi and
pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
(and maybe it allows us to remove the no_direct helpers).

--Jesper

[-- Attachment #2: 01-cpumap-disable-pp-direct.patch --]
[-- Type: text/x-patch, Size: 2101 bytes --]

cpumap: disable page_pool direct xdp_return need larger scope

From: Jesper Dangaard Brouer <hawk@kernel.org>

When running an XDP bpf_prog on the remote CPU in cpumap code
then we must disable the direct return optimization that
xdp_return can perform for mem_type page_pool.  This optimization
assumes code is still executing under RX-NAPI of the original
receiving CPU, which isn't true on this remote CPU.

The cpumap code already disabled this via helpers
xdp_set_return_frame_no_direct() and xdp_clear_return_frame_no_direct(),
but the scope didn't include xdp_do_flush().

When doing XDP_REDIRECT towards e.g devmap this causes the
function bq_xmit_all() to run with direct return optimization
enabled. This can lead to hard to find bugs.

Fix by expanding scope to include xdp_do_flush().

Fixes: 11941f8a8536 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 kernel/bpf/cpumap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b2b7b8ec2c2a..c46360b27871 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -186,7 +186,6 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
-	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
 	for (i = 0; i < n; i++) {
@@ -231,7 +230,6 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 		}
 	}
 
-	xdp_clear_return_frame_no_direct();
 	stats->pass += nframes;
 
 	return nframes;
@@ -255,6 +253,7 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 
 	rcu_read_lock();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	xdp_set_return_frame_no_direct();
 
 	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
 	if (unlikely(ret->skb_n))
@@ -264,6 +263,7 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (stats->redirect)
 		xdp_do_flush();
 
+	xdp_clear_return_frame_no_direct();
 	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock();
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-14 11:26                         ` Jesper Dangaard Brouer
@ 2025-08-14 14:42                           ` Dragos Tatulea
  2025-08-14 15:58                             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-14 14:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Chris Arges
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Simon Horman, Andrew Rzeznik, Yan Zhai,
	Kumar Kartikeya Dwivedi

On Thu, Aug 14, 2025 at 01:26:37PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 13/08/2025 22.24, Dragos Tatulea wrote:
> > On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
> > > On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
> > > > On 2025-08-12 16:25:58, Chris Arges wrote:
> > > > > On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> > > > > > On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > > > > > > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > > > > > > 
> > > > > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > > > > index 482d284a1553..484216c7454d 100644
> > > > > > > > --- a/kernel/bpf/devmap.c
> > > > > > > > +++ b/kernel/bpf/devmap.c
> > > > > > > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > > > > >           /* If not all frames have been transmitted, it is our
> > > > > > > >            * responsibility to free them
> > > > > > > >            */
> > > > > > > > +       xdp_set_return_frame_no_direct();
> > > > > > > >           for (i = sent; unlikely(i < to_send); i++)
> > > > > > > >                   xdp_return_frame_rx_napi(bq->q[i]);
> > > > > > > > +       xdp_clear_return_frame_no_direct();
> > > > > > > 
> > > > > > > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > > > > > > "no_direct" fussing?
> > > > > > > 
> > > > > > > Wouldn't this be the safest way for this function to call frame completion?
> > > > > > > It seems like presuming the calling context is napi is wrong?
> > > > > > > 
> > > > > > It would be better indeed. Thanks for removing my horse glasses!
> > > > > > 
> > > > > > Once Chris verifies that this works for him I can prepare a fix patch.
> > > > > > 
> > > > > Working on that now, I'm testing a kernel with the following change:
> > > > > 
> > > > > ---
> > > > > 
> > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > index 3aa002a47..ef86d9e06 100644
> > > > > --- a/kernel/bpf/devmap.c
> > > > > +++ b/kernel/bpf/devmap.c
> > > > > @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > >           * responsibility to free them
> > > > >           */
> > > > >          for (i = sent; unlikely(i < to_send); i++)
> > > > > -               xdp_return_frame_rx_napi(bq->q[i]);
> > > > > +               xdp_return_frame(bq->q[i]);
> > > > >   out:
> > > > >          bq->count = 0;
> > > > 
> > > > This patch resolves the issue I was seeing and I am no longer able to
> > > > reproduce the issue. I tested for about 2 hours, when the reproducer usually
> > > > takes about 1-2 minutes.
> > > > 
> > > Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.
> > > 
> 
> Looking at code ... there are more cases we need to deal with.
> If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.
> 
> The normal way to fix this is to use the helpers:
>  - xdp_set_return_frame_no_direct();
>  - xdp_clear_return_frame_no_direct()
> 
> Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
> disable those napi_direct requests.
> 
>  [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439
>
> Something doesn't add-up, because the remote CPUMAP bpf-prog that redirects
> to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that function
> already uses the xdp_set_return_frame_no_direct() helper.
> 
>  [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189
> 
> I see the bug now... attached a patch with the fix.
> The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.
> 
> Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement
> generic cpumap") v5.15.
>
Nice! Thanks for looking at this! Will you send the patch separately?

> > > As follow up work it would be good to have a way to catch this family of
> > > issues. Something in the lines of the patch below.
> > > 
> 
> Yes, please, we want something that can catch these kind of hard to find
> bugs.
>
Will send a patch when I find some time.

> > > Thanks,
> > > Dragos
> > > 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index f1373756cd0f..0c498fbd8df6 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
> > >   {
> > >          lockdep_assert_no_hardirq();
> > > +#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
> > > +       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
> > I meant to negate the condition here.
> > 
> 
> The XDP code have evolved since the xdp_set_return_frame_no_direct()
> calls were added.  Now page_pool keeps track of pp->napi and
> pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
> (and maybe it allows us to remove the no_direct helpers).
> 
So you mean to drop the napi_direct flag in __xdp_return and let
page_pool_put_unrefed_netmem() decide if direct should be used by
page_pool_napi_local()?

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-14 14:42                           ` Dragos Tatulea
@ 2025-08-14 15:58                             ` Jesper Dangaard Brouer
  2025-08-14 16:45                               ` Dragos Tatulea
  2025-08-15 14:59                               ` Jakub Kicinski
  0 siblings, 2 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-14 15:58 UTC (permalink / raw)
  To: Dragos Tatulea, Chris Arges, Jakub Kicinski
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai, Kumar Kartikeya Dwivedi



On 14/08/2025 16.42, Dragos Tatulea wrote:
> On Thu, Aug 14, 2025 at 01:26:37PM +0200, Jesper Dangaard Brouer wrote:
>>
>>
>> On 13/08/2025 22.24, Dragos Tatulea wrote:
>>> On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
>>>> On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
>>>>> On 2025-08-12 16:25:58, Chris Arges wrote:
>>>>>> On 2025-08-12 20:19:30, Dragos Tatulea wrote:
>>>>>>> On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
>>>>>>>> On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
>>>>>>>>
>>>>>>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>>>>>>>>> index 482d284a1553..484216c7454d 100644
>>>>>>>>> --- a/kernel/bpf/devmap.c
>>>>>>>>> +++ b/kernel/bpf/devmap.c
>>>>>>>>> @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>>>>>>            /* If not all frames have been transmitted, it is our
>>>>>>>>>             * responsibility to free them
>>>>>>>>>             */
>>>>>>>>> +       xdp_set_return_frame_no_direct();
>>>>>>>>>            for (i = sent; unlikely(i < to_send); i++)
>>>>>>>>>                    xdp_return_frame_rx_napi(bq->q[i]);
>>>>>>>>> +       xdp_clear_return_frame_no_direct();
>>>>>>>>
>>>>>>>> Why can't this instead just be xdp_return_frame(bq->q[i]); with no
>>>>>>>> "no_direct" fussing?
>>>>>>>>
>>>>>>>> Wouldn't this be the safest way for this function to call frame completion?
>>>>>>>> It seems like presuming the calling context is napi is wrong?
>>>>>>>>
>>>>>>> It would be better indeed. Thanks for removing my horse glasses!
>>>>>>>
>>>>>>> Once Chris verifies that this works for him I can prepare a fix patch.
>>>>>>>
>>>>>> Working on that now, I'm testing a kernel with the following change:
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>>>>>> index 3aa002a47..ef86d9e06 100644
>>>>>> --- a/kernel/bpf/devmap.c
>>>>>> +++ b/kernel/bpf/devmap.c
>>>>>> @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>>>            * responsibility to free them
>>>>>>            */
>>>>>>           for (i = sent; unlikely(i < to_send); i++)
>>>>>> -               xdp_return_frame_rx_napi(bq->q[i]);
>>>>>> +               xdp_return_frame(bq->q[i]);
>>>>>>    out:
>>>>>>           bq->count = 0;
>>>>>
>>>>> This patch resolves the issue I was seeing and I am no longer able to
>>>>> reproduce the issue. I tested for about 2 hours, when the reproducer usually
>>>>> takes about 1-2 minutes.
>>>>>
>>>> Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.
>>>>
>>
>> Looking at code ... there are more cases we need to deal with.
>> If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.
>>
>> The normal way to fix this is to use the helpers:
>>   - xdp_set_return_frame_no_direct();
>>   - xdp_clear_return_frame_no_direct()
>>
>> Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
>> disable those napi_direct requests.
>>
>>   [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439
>>
>> Something doesn't add-up, because the remote CPUMAP bpf-prog that redirects
>> to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that function
>> already uses the xdp_set_return_frame_no_direct() helper.
>>
>>   [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189
>>
>> I see the bug now... attached a patch with the fix.
>> The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.
>>
>> Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement
>> generic cpumap") v5.15.
>>
> Nice! Thanks for looking at this! Will you send the patch separately?
> 

Yes, I will send the patch as an official patch.

I want to give both of you credit, so I'm considering adding these tags
to the patch description (WDYT):

Found-by: Dragos Tatulea <dtatulea@nvidia.com>
Reported-by: Chris Arges <carges@cloudflare.com>


>>>> As follow up work it would be good to have a way to catch this family of
>>>> issues. Something in the lines of the patch below.
>>>>
>>
>> Yes, please, we want something that can catch these kind of hard to find
>> bugs.
>>
> Will send a patch when I find some time.
>

Great! :-)

>>>> Thanks,
>>>> Dragos
>>>>
>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>> index f1373756cd0f..0c498fbd8df6 100644
>>>> --- a/net/core/page_pool.c
>>>> +++ b/net/core/page_pool.c
>>>> @@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>>>    {
>>>>           lockdep_assert_no_hardirq();
>>>> +#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
>>>> +       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
>>> I meant to negate the condition here.
>>>
>>
>> The XDP code have evolved since the xdp_set_return_frame_no_direct()
>> calls were added.  Now page_pool keeps track of pp->napi and
>> pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
>> (and maybe it allows us to remove the no_direct helpers).
>>
> So you mean to drop the napi_direct flag in __xdp_return and let
> page_pool_put_unrefed_netmem() decide if direct should be used by
> page_pool_napi_local()?

Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
he introduced the page_pool->cpuid and page_pool->napi.

There are some corner-cases we need to consider if they are valid.  If
cpumap get redirected to the *same* CPU as "previous" NAPI instance,
which then makes page_pool->cpuid match, is it then still valid to do
"direct" return(?).

--Jesper

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-14 15:58                             ` Jesper Dangaard Brouer
@ 2025-08-14 16:45                               ` Dragos Tatulea
  2025-08-15 14:59                               ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Dragos Tatulea @ 2025-08-14 16:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Chris Arges, Jakub Kicinski
  Cc: Jesse Brandeburg, netdev, bpf, kernel-team, tariqt, saeedm,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Simon Horman, Andrew Rzeznik, Yan Zhai, Kumar Kartikeya Dwivedi

On Thu, Aug 14, 2025 at 05:58:21PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 14/08/2025 16.42, Dragos Tatulea wrote:
> > On Thu, Aug 14, 2025 at 01:26:37PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 13/08/2025 22.24, Dragos Tatulea wrote:
> > > > On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
> > > > > On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
> > > > > > On 2025-08-12 16:25:58, Chris Arges wrote:
> > > > > > > On 2025-08-12 20:19:30, Dragos Tatulea wrote:
> > > > > > > > On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
> > > > > > > > > On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:
> > > > > > > > > 
> > > > > > > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > > > > > > index 482d284a1553..484216c7454d 100644
> > > > > > > > > > --- a/kernel/bpf/devmap.c
> > > > > > > > > > +++ b/kernel/bpf/devmap.c
> > > > > > > > > > @@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > > > > > > >            /* If not all frames have been transmitted, it is our
> > > > > > > > > >             * responsibility to free them
> > > > > > > > > >             */
> > > > > > > > > > +       xdp_set_return_frame_no_direct();
> > > > > > > > > >            for (i = sent; unlikely(i < to_send); i++)
> > > > > > > > > >                    xdp_return_frame_rx_napi(bq->q[i]);
> > > > > > > > > > +       xdp_clear_return_frame_no_direct();
> > > > > > > > > 
> > > > > > > > > Why can't this instead just be xdp_return_frame(bq->q[i]); with no
> > > > > > > > > "no_direct" fussing?
> > > > > > > > > 
> > > > > > > > > Wouldn't this be the safest way for this function to call frame completion?
> > > > > > > > > It seems like presuming the calling context is napi is wrong?
> > > > > > > > > 
> > > > > > > > It would be better indeed. Thanks for removing my horse glasses!
> > > > > > > > 
> > > > > > > > Once Chris verifies that this works for him I can prepare a fix patch.
> > > > > > > > 
> > > > > > > Working on that now, I'm testing a kernel with the following change:
> > > > > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> > > > > > > index 3aa002a47..ef86d9e06 100644
> > > > > > > --- a/kernel/bpf/devmap.c
> > > > > > > +++ b/kernel/bpf/devmap.c
> > > > > > > @@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > > > > > >            * responsibility to free them
> > > > > > >            */
> > > > > > >           for (i = sent; unlikely(i < to_send); i++)
> > > > > > > -               xdp_return_frame_rx_napi(bq->q[i]);
> > > > > > > +               xdp_return_frame(bq->q[i]);
> > > > > > >    out:
> > > > > > >           bq->count = 0;
> > > > > > 
> > > > > > This patch resolves the issue I was seeing and I am no longer able to
> > > > > > reproduce the issue. I tested for about 2 hours, when the reproducer usually
> > > > > > takes about 1-2 minutes.
> > > > > > 
> > > > > Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.
> > > > > 
> > > 
> > > Looking at code ... there are more cases we need to deal with.
> > > If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.
> > > 
> > > The normal way to fix this is to use the helpers:
> > >   - xdp_set_return_frame_no_direct();
> > >   - xdp_clear_return_frame_no_direct()
> > > 
> > > Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
> > > disable those napi_direct requests.
> > > 
> > >   [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439
> > > 
> > > Something doesn't add-up, because the remote CPUMAP bpf-prog that redirects
> > > to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that function
> > > already uses the xdp_set_return_frame_no_direct() helper.
> > > 
> > >   [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189
> > > 
> > > I see the bug now... attached a patch with the fix.
> > > The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.
> > > 
> > > Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement
> > > generic cpumap") v5.15.
> > > 
> > Nice! Thanks for looking at this! Will you send the patch separately?
> > 
> 
> Yes, I will send the patch as an official patch.
> 
> I want to give both of you credit, so I'm considering adding these tags
> to the patch description (WDYT):
> 
> Found-by: Dragos Tatulea <dtatulea@nvidia.com>
> Reported-by: Chris Arges <carges@cloudflare.com>
>
Sure. Much appreciated.

> 
> > > > > As follow up work it would be good to have a way to catch this family of
> > > > > issues. Something in the lines of the patch below.
> > > > > 
> > > 
> > > Yes, please, we want something that can catch these kind of hard to find
> > > bugs.
> > > 
> > Will send a patch when I find some time.
> > 
> 
> Great! :-)
> 
> > > > > Thanks,
> > > > > Dragos
> > > > > 
> > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > > > index f1373756cd0f..0c498fbd8df6 100644
> > > > > --- a/net/core/page_pool.c
> > > > > +++ b/net/core/page_pool.c
> > > > > @@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
> > > > >    {
> > > > >           lockdep_assert_no_hardirq();
> > > > > +#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
> > > > > +       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
> > > > I meant to negate the condition here.
> > > > 
> > > 
> > > The XDP code have evolved since the xdp_set_return_frame_no_direct()
> > > calls were added.  Now page_pool keeps track of pp->napi and
> > > pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
> > > (and maybe it allows us to remove the no_direct helpers).
> > > 
> > So you mean to drop the napi_direct flag in __xdp_return and let
> > page_pool_put_unrefed_netmem() decide if direct should be used by
> > page_pool_napi_local()?
> 
> Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
> he introduced the page_pool->cpuid and page_pool->napi.
> 
> There are some corner-cases we need to consider if they are valid.  If
> cpumap get redirected to the *same* CPU as "previous" NAPI instance,
> which then makes page_pool->cpuid match, is it then still valid to do
> "direct" return(?).
Understood. Let's see.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-14 15:58                             ` Jesper Dangaard Brouer
  2025-08-14 16:45                               ` Dragos Tatulea
@ 2025-08-15 14:59                               ` Jakub Kicinski
  2025-08-15 16:02                                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2025-08-15 14:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dragos Tatulea, Chris Arges, Jesse Brandeburg, netdev, bpf,
	kernel-team, tariqt, saeedm, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Simon Horman, Andrew Rzeznik,
	Yan Zhai, Kumar Kartikeya Dwivedi

On Thu, 14 Aug 2025 17:58:21 +0200 Jesper Dangaard Brouer wrote:
> Found-by: Dragos Tatulea <dtatulea@nvidia.com>

ENOSUCHTAG?

> Reported-by: Chris Arges <carges@cloudflare.com>

> >> The XDP code have evolved since the xdp_set_return_frame_no_direct()
> >> calls were added.  Now page_pool keeps track of pp->napi and
> >> pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
> >> (and maybe it allows us to remove the no_direct helpers).
> >>  
> > So you mean to drop the napi_direct flag in __xdp_return and let
> > page_pool_put_unrefed_netmem() decide if direct should be used by
> > page_pool_napi_local()?  
> 
> Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
> he introduced the page_pool->cpuid and page_pool->napi.
> 
> There are some corner-cases we need to consider if they are valid.  If
> cpumap get redirected to the *same* CPU as "previous" NAPI instance,
> which then makes page_pool->cpuid match, is it then still valid to do
> "direct" return(?).

I think/hope so, but it depends on xdp_return only being called from
softirq context.. Since softirqs can't nest if producer and consumer 
of the page pool pages are on the same CPU they can't race.
I'm slightly worried that drivers which don't have dedicated Tx XDP
rings will clean it up from hard IRQ when netpoll calls. But that'd
be a bug, right? We don't allow XDP processing from IRQ context.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-15 14:59                               ` Jakub Kicinski
@ 2025-08-15 16:02                                 ` Jesper Dangaard Brouer
  2025-08-15 16:36                                   ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-15 16:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Dragos Tatulea, Chris Arges, Jesse Brandeburg, netdev, bpf,
	kernel-team, tariqt, saeedm, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Simon Horman, Andrew Rzeznik,
	Yan Zhai, Kumar Kartikeya Dwivedi, Mina Almasry



On 15/08/2025 16.59, Jakub Kicinski wrote:
> On Thu, 14 Aug 2025 17:58:21 +0200 Jesper Dangaard Brouer wrote:
>> Found-by: Dragos Tatulea <dtatulea@nvidia.com>
> 
> ENOSUCHTAG?
>

I pre-checked that "Found-by:" have already been used 32 times in git-
history. But don't worry, Martin applied it such that it isn't in the
tags section, by removing the ":" and placing it in the desc part.

>> Reported-by: Chris Arges <carges@cloudflare.com>
> 
>>>> The XDP code have evolved since the xdp_set_return_frame_no_direct()
>>>> calls were added.  Now page_pool keeps track of pp->napi and
>>>> pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
>>>> (and maybe it allows us to remove the no_direct helpers).
>>>>   
>>> So you mean to drop the napi_direct flag in __xdp_return and let
>>> page_pool_put_unrefed_netmem() decide if direct should be used by
>>> page_pool_napi_local()?
>>
>> Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
>> he introduced the page_pool->cpuid and page_pool->napi.
>>
>> There are some corner-cases we need to consider if they are valid.  If
>> cpumap get redirected to the *same* CPU as "previous" NAPI instance,
>> which then makes page_pool->cpuid match, is it then still valid to do
>> "direct" return(?).
> 
> I think/hope so, but it depends on xdp_return only being called from
> softirq context.. Since softirqs can't nest if producer and consumer
> of the page pool pages are on the same CPU they can't race.

That is true, softirqs can't nest.

Jesse pointed me at the tun device driver, where we in-principle are 
missing a xdp_set_return_frame_no_direct() section. Except I believe, 
that the memory type cannot be page_pool in this driver. (Code hint, 
tun_xdp_act() calls xdp_do_redirect).

The tun driver made me realize, that we do have users that doesn't run 
under a softirq, but they do remember to disable BH. (IIRC BH-disable 
can nest).  Are we also race safe in this case(?).

Is the code change as simple as below or did I miss something?

void __xdp_return
   [...]
   case MEM_TYPE_PAGE_POOL:
    [...]
     if (napi_direct && READ_ONCE(pool->cpuid) != smp_processor_id())
	napi_direct = false;

It is true, that when we exit NAPI, then pool->cpuid becomes -1.
Or what that only during shutdown?

> I'm slightly worried that drivers which don't have dedicated Tx XDP
> rings will clean it up from hard IRQ when netpoll calls. But that'd
> be a bug, right? We don't allow XDP processing from IRQ context.

I didn't consider this code path. But, yes that would be considered a
netpoll bug IMHO.

--Jesper

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUG] mlx5_core memory management issue
  2025-08-15 16:02                                 ` Jesper Dangaard Brouer
@ 2025-08-15 16:36                                   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2025-08-15 16:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dragos Tatulea, Chris Arges, Jesse Brandeburg, netdev, bpf,
	kernel-team, tariqt, saeedm, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Simon Horman, Andrew Rzeznik,
	Yan Zhai, Kumar Kartikeya Dwivedi, Mina Almasry

On Fri, 15 Aug 2025 18:02:10 +0200 Jesper Dangaard Brouer wrote:
> >> Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
> >> he introduced the page_pool->cpuid and page_pool->napi.
> >>
> >> There are some corner-cases we need to consider if they are valid.  If
> >> cpumap get redirected to the *same* CPU as "previous" NAPI instance,
> >> which then makes page_pool->cpuid match, is it then still valid to do
> >> "direct" return(?).  
> > 
> > I think/hope so, but it depends on xdp_return only being called from
> > softirq context.. Since softirqs can't nest if producer and consumer
> > of the page pool pages are on the same CPU they can't race.  
> 
> That is true, softirqs can't nest.
> 
> Jesse pointed me at the tun device driver, where we in-principle are 
> missing a xdp_set_return_frame_no_direct() section. Except I believe, 
> that the memory type cannot be page_pool in this driver. (Code hint, 
> tun_xdp_act() calls xdp_do_redirect).
> 
> The tun driver made me realize, that we do have users that doesn't run 
> under a softirq, but they do remember to disable BH. (IIRC BH-disable 
> can nest).  Are we also race safe in this case(?).

Yes, it should be. But chances of direct recycling happening in this
case are rather low since NAPI needs to be pending to be considered
owned. If we're coming from process context BHs are likely not pending.

> Is the code change as simple as below or did I miss something?
> 
> void __xdp_return
>    [...]
>    case MEM_TYPE_PAGE_POOL:
>     [...]
>      if (napi_direct && READ_ONCE(pool->cpuid) != smp_processor_id())
> 	napi_direct = false;

cpuid is a different beast, NAPI-based direct recycling logic is in
page_pool_napi_local() (and we should not let it leak out to XDP,
just unref the page and PP will "override" the "napi_safe" argument).

> It is true, that when we exit NAPI, then pool->cpuid becomes -1.
> Or what that only during shutdown?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-15 16:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 15:49 [BUG] mlx5_core memory management issue Chris Arges
2025-07-04 12:37 ` Dragos Tatulea
2025-07-04 20:14   ` Dragos Tatulea
2025-07-07 22:07     ` Chris Arges
2025-07-23 18:48   ` Chris Arges
2025-07-24 17:01     ` Dragos Tatulea
2025-08-07 16:45       ` Chris Arges
2025-08-11  8:37         ` Dragos Tatulea
2025-08-12 15:44           ` Dragos Tatulea
2025-08-12 18:55             ` Jesse Brandeburg
2025-08-12 20:19               ` Dragos Tatulea
2025-08-12 21:25                 ` Chris Arges
2025-08-13 18:53                   ` Chris Arges
2025-08-13 19:26                     ` Dragos Tatulea
2025-08-13 20:24                       ` Dragos Tatulea
2025-08-14 11:26                         ` Jesper Dangaard Brouer
2025-08-14 14:42                           ` Dragos Tatulea
2025-08-14 15:58                             ` Jesper Dangaard Brouer
2025-08-14 16:45                               ` Dragos Tatulea
2025-08-15 14:59                               ` Jakub Kicinski
2025-08-15 16:02                                 ` Jesper Dangaard Brouer
2025-08-15 16:36                                   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).