* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
@ 2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
` (4 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07 6:46 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Big improvement again!
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
@ 2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 12:28 ` Eric Dumazet
2025-11-07 15:44 ` Jason Xing
` (3 subsequent siblings)
5 siblings, 1 reply; 37+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Impressive!
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-07 11:23 ` Toke Høiland-Jørgensen
@ 2025-11-07 12:28 ` Eric Dumazet
2025-11-07 14:34 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2025-11-07 12:28 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Impressive!
>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Thanks !
Note that my upcoming plan is also to plumb skb_attempt_defer_free()
into __kfree_skb().
[ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]
TCP under pressure would benefit from this a _lot_.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-07 12:28 ` Eric Dumazet
@ 2025-11-07 14:34 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 37+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 14:34 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
Eric Dumazet <edumazet@google.com> writes:
> On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Impressive!
>>
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Thanks !
>
> Note that my upcoming plan is also to plumb skb_attempt_defer_free()
> into __kfree_skb().
>
> [ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]
>
> TCP under pressure would benefit from this a _lot_.
Interesting; look forward to seeing the results :)
-Toke
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
@ 2025-11-07 15:44 ` Jason Xing
2025-11-11 16:56 ` Aditya Garg
` (2 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Jason Xing @ 2025-11-07 15:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Thanks for your brilliant work that really gives me so much
inspiration one more time.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks,
Jason
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (2 preceding siblings ...)
2025-11-07 15:44 ` Jason Xing
@ 2025-11-11 16:56 ` Aditya Garg
2025-11-11 17:17 ` Eric Dumazet
2025-11-12 14:03 ` Jon Hunter
2026-03-09 21:00 ` Long delay in freeing skbs Tony Battersby
5 siblings, 1 reply; 37+ messages in thread
From: Aditya Garg @ 2025-11-11 16:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
> --
> 2.51.2.1041.gc1ab5b90ca-goog
>
I ran these tests on latest net-next for MANA driver and I am observing a regression here.
lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
[ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
[ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
[ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
[ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
[ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
[ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
iperf Done.
I tested again by reverting this patch and regression was not there.
lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
[ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
[ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
[ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
[ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
[ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
[ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
[ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
[ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
[ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
[ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
[ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
[ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
[ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
[ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
[ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
[ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
[ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
[ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
[ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
[ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
[ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
[ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
[ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
[ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
[ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
[ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
[ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
[ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
[ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
[ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
iperf Done.
I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
Regards,
Aditya
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-11 16:56 ` Aditya Garg
@ 2025-11-11 17:17 ` Eric Dumazet
2025-11-12 5:14 ` Aditya Garg
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2025-11-11 17:17 UTC (permalink / raw)
To: Aditya Garg
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
<gargaditya@linux.microsoft.com> wrote:
>
> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
> >
> > 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> > 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> > 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> > 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> > 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> > 2.11% swapper [kernel.kallsyms] [k] __slab_free
> > 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> > 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> > 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> > 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> > 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> > 0.93% swapper [kernel.kallsyms] [k] read_tsc
> > 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> > 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> > 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> > 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> > 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> > 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> > 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> > 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> > 0.48% swapper [kernel.kallsyms] [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
> >
> > 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> > 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> > 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> > 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> > 2.73% swapper [kernel.kallsyms] [k] read_tsc
> > 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> > 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> > 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> > 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> > 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> > 0.53% swapper [kernel.kallsyms] [k] io_idle
> > 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> > 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> > 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> > 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> > 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> > 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> > 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> > 0.32% swapper [kernel.kallsyms] [k] dql_completed
> > 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> > 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> > 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> > 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> > 0.28% swapper [kernel.kallsyms] [k] ktime_get
> > 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > net/core/skbuff.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > + skb_release_head_state(skb);
> > + return skb_attempt_defer_free(skb);
> > + }
> > +
> > if (!skb_unref(skb))
> > return;
> >
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
>
> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>
> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
> [ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
> [ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
> [ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
> [ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
> [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
>
> iperf Done.
>
>
> I tested again by reverting this patch and regression was not there.
>
> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
> [ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
> [ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
> [ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
> [ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
> [ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
> [ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
> [ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
> [ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
> [ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
> [ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
> [ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
> [ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
> [ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
> [ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
> [ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
> [ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
> [ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
> [ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
> [ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
> [ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
> [ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
> [ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
> [ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
> [ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
> [ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
> [ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
> [ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
> [ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
> [ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
>
> iperf Done.
>
>
> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
Thanks !
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-11 17:17 ` Eric Dumazet
@ 2025-11-12 5:14 ` Aditya Garg
0 siblings, 0 replies; 37+ messages in thread
From: Aditya Garg @ 2025-11-12 5:14 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On 11-11-2025 22:47, Eric Dumazet wrote:
> On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
> <gargaditya@linux.microsoft.com> wrote:
>>
>> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
>>> There is a lack of NUMA awareness and more generally lack
>>> of slab caches affinity on TX completion path.
>>>
>>> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
>>> in per-cpu caches so that they can be recycled in RX path.
>>>
>>> Only use this if the skb was allocated on the same cpu,
>>> otherwise use skb_attempt_defer_free() so that the skb
>>> is freed on the original cpu.
>>>
>>> This removes contention on SLUB spinlocks and data structures.
>>>
>>> After this patch, I get ~50% improvement for an UDP tx workload
>>> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>>>
>>> 80 Mpps -> 120 Mpps.
>>>
>>> Profiling one of the 32 cpus servicing NIC interrupts :
>>>
>>> Before:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>>>
>>> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
>>> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
>>> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
>>> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
>>> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
>>> 2.11% swapper [kernel.kallsyms] [k] __slab_free
>>> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
>>> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
>>> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
>>> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
>>> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
>>> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
>>> 0.93% swapper [kernel.kallsyms] [k] read_tsc
>>> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
>>> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
>>> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
>>> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
>>> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
>>> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
>>> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
>>> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
>>> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>>>
>>> After:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>>>
>>> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
>>> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
>>> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
>>> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
>>> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
>>> 2.73% swapper [kernel.kallsyms] [k] read_tsc
>>> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
>>> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
>>> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
>>> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
>>> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
>>> 0.53% swapper [kernel.kallsyms] [k] io_idle
>>> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
>>> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
>>> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
>>> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
>>> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
>>> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
>>> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
>>> 0.32% swapper [kernel.kallsyms] [k] dql_completed
>>> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
>>> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
>>> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
>>> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
>>> 0.28% swapper [kernel.kallsyms] [k] ktime_get
>>> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>> net/core/skbuff.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>>>
>>> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>>>
>>> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
>>> + skb_release_head_state(skb);
>>> + return skb_attempt_defer_free(skb);
>>> + }
>>> +
>>> if (!skb_unref(skb))
>>> return;
>>>
>>> --
>>> 2.51.2.1041.gc1ab5b90ca-goog
>>>
>>
>> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>>
>> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
>> [ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
>> [ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
>> [ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
>> [ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
>> [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bitrate Retr
>> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
>> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
>>
>> iperf Done.
>>
>>
>> I tested again by reverting this patch and regression was not there.
>>
>> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
>> [ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
>> [ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
>> [ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
>> [ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
>> [ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
>> [ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
>> [ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
>> [ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
>> [ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
>> [ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
>> [ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
>> [ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
>> [ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
>> [ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
>> [ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
>> [ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
>> [ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
>> [ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
>> [ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
>> [ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
>> [ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
>> [ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
>> [ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
>> [ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
>> [ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
>> [ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
>> [ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
>> [ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
>> [ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bitrate Retr
>> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
>> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
>>
>> iperf Done.
>>
>>
>> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
>
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
>
> Thanks !
Thanks Eric, it works fine after above fix.
Regards,
Aditya
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (3 preceding siblings ...)
2025-11-11 16:56 ` Aditya Garg
@ 2025-11-12 14:03 ` Jon Hunter
2025-11-12 14:08 ` Eric Dumazet
2026-03-09 21:00 ` Long delay in freeing skbs Tony Battersby
5 siblings, 1 reply; 37+ messages in thread
From: Jon Hunter @ 2025-11-12 14:03 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, linux-tegra@vger.kernel.org
Hi Eric,
On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.
Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
r8169 0000:01:00.0: enabling device (0140 -> 0143)
r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...
r8169 0000:01:00.0 enp1s0: Link is Down
tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
OOM killer enabled.
Restarting tasks: Starting
Restarting tasks: Done
random: crng reseeded on system resumption
PM: suspend exit
ata1: SATA link down (SStatus 0 SControl 300)
r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.
Any thoughts?
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 14:03 ` Jon Hunter
@ 2025-11-12 14:08 ` Eric Dumazet
2025-11-12 15:26 ` Jon Hunter
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2025-11-12 14:08 UTC (permalink / raw)
To: Jon Hunter
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On Wed, Nov 12, 2025 at 6:03 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Eric,
>
> On 06/11/2025 20:29, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
> >
> > 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> > 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> > 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> > 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> > 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> > 2.11% swapper [kernel.kallsyms] [k] __slab_free
> > 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> > 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> > 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> > 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> > 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> > 0.93% swapper [kernel.kallsyms] [k] read_tsc
> > 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> > 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> > 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> > 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> > 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> > 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> > 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> > 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> > 0.48% swapper [kernel.kallsyms] [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
> >
> > 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> > 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> > 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> > 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> > 2.73% swapper [kernel.kallsyms] [k] read_tsc
> > 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> > 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> > 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> > 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> > 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> > 0.53% swapper [kernel.kallsyms] [k] io_idle
> > 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> > 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> > 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> > 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> > 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> > 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> > 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> > 0.32% swapper [kernel.kallsyms] [k] dql_completed
> > 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> > 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> > 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> > 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> > 0.28% swapper [kernel.kallsyms] [k] ktime_get
> > 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > net/core/skbuff.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > + skb_release_head_state(skb);
> > + return skb_attempt_defer_free(skb);
> > + }
> > +
> > if (!skb_unref(skb))
> > return;
> >
>
> I have noticed a suspend regression on one of our Tegra boards. Bisect
> is pointing to this commit and reverting this on top of -next fixes the
> issue.
>
> Out of all the Tegra boards we test only one is failing and that is the
> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>
> r8169 0000:01:00.0: enabling device (0140 -> 0143)
> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>
> I don't see any particular crash or error, and even after resuming from
> suspend the link does come up ...
>
> r8169 0000:01:00.0 enp1s0: Link is Down
> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> OOM killer enabled.
> Restarting tasks: Starting
> Restarting tasks: Done
> random: crng reseeded on system resumption
> PM: suspend exit
> ata1: SATA link down (SStatus 0 SControl 300)
> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>
> However, the board does not seem to resume fully. One thing I should
> point out is that for testing we always use an NFS rootfs. So this
> would indicate that the link comes up but networking is still having
> issues.
>
> Any thoughts?
>
> Jon
Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
Thanks !
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 14:08 ` Eric Dumazet
@ 2025-11-12 15:26 ` Jon Hunter
2025-11-12 15:32 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Jon Hunter @ 2025-11-12 15:26 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On 12/11/2025 14:08, Eric Dumazet wrote:
...
>> I have noticed a suspend regression on one of our Tegra boards. Bisect
>> is pointing to this commit and reverting this on top of -next fixes the
>> issue.
>>
>> Out of all the Tegra boards we test only one is failing and that is the
>> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>>
>> r8169 0000:01:00.0: enabling device (0140 -> 0143)
>> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
>> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>>
>> I don't see any particular crash or error, and even after resuming from
>> suspend the link does come up ...
>>
>> r8169 0000:01:00.0 enp1s0: Link is Down
>> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
>> OOM killer enabled.
>> Restarting tasks: Starting
>> Restarting tasks: Done
>> random: crng reseeded on system resumption
>> PM: suspend exit
>> ata1: SATA link down (SStatus 0 SControl 300)
>> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>
>> However, the board does not seem to resume fully. One thing I should
>> point out is that for testing we always use an NFS rootfs. So this
>> would indicate that the link comes up but networking is still having
>> issues.
>>
>> Any thoughts?
>>
>> Jon
>
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
That does indeed fix it. Feel free to add my ...
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Thanks
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 15:26 ` Jon Hunter
@ 2025-11-12 15:32 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2025-11-12 15:32 UTC (permalink / raw)
To: Jon Hunter
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On Wed, Nov 12, 2025 at 7:27 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 12/11/2025 14:08, Eric Dumazet wrote:
>
> ...
>
> >> I have noticed a suspend regression on one of our Tegra boards. Bisect
> >> is pointing to this commit and reverting this on top of -next fixes the
> >> issue.
> >>
> >> Out of all the Tegra boards we test only one is failing and that is the
> >> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
> >>
> >> r8169 0000:01:00.0: enabling device (0140 -> 0143)
> >> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> >> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
> >>
> >> I don't see any particular crash or error, and even after resuming from
> >> suspend the link does come up ...
> >>
> >> r8169 0000:01:00.0 enp1s0: Link is Down
> >> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> >> OOM killer enabled.
> >> Restarting tasks: Starting
> >> Restarting tasks: Done
> >> random: crng reseeded on system resumption
> >> PM: suspend exit
> >> ata1: SATA link down (SStatus 0 SControl 300)
> >> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
> >>
> >> However, the board does not seem to resume fully. One thing I should
> >> point out is that for testing we always use an NFS rootfs. So this
> >> would indicate that the link comes up but networking is still having
> >> issues.
> >>
> >> Any thoughts?
> >>
> >> Jon
> >
> > Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
>
>
> That does indeed fix it. Feel free to add my ...
>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>
Thanks for testing. Note the patch was merged already in net-next, so
we can not add your tag.
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Nov 11 15:12:35 2025 +0000
net: clear skb->sk in skb_release_head_state()
skb_release_head_state() inlines skb_orphan().
We need to clear skb->sk otherwise we can freeze TCP flows
on a mostly idle host, because skb_fclone_busy() would
return true as long as the packet is not yet processed by
skb_defer_free_flush().
Fixes: 1fcf572211da ("net: allow skb_release_head_state() to be
called multiple times")
Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251111151235.1903659-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Long delay in freeing skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (4 preceding siblings ...)
2025-11-12 14:03 ` Jon Hunter
@ 2026-03-09 21:00 ` Tony Battersby
2026-03-09 21:07 ` Eric Dumazet
5 siblings, 1 reply; 37+ messages in thread
From: Tony Battersby @ 2026-03-09 21:00 UTC (permalink / raw)
To: Eric Dumazet, netdev
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, eric.dumazet,
David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
I have some out-of-tree code that was broken by commit e20dfbad8aab
("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
retested with 7.0-rc3 with the same results. I'm not sure if the change
in behavior that I am seeing is expected or if it is a bug, so I thought
I would ask. Omitting irrelevant details, my kernel code does the
following:
1) allocate a buffer
2) perform some network I/O using the buffer (cmd over iscsi_tcp)
3) wait for the I/O to complete
4) wait for the buffer page refcounts to drop back to their values for
idle I/O
5) reuse the buffer for more I/O
Step 4 is necessary in this case due to some of those irrelevant details
that I omitted, and that is where it trips up. Commit e20dfbad8aab
seems to keep the page refcounts elevated long after the I/O has
completed (hours or longer), presumably because the deferred skb free
isn't happening for a long time. So:
1) Is this long delay expected, or is this a bug?
2) If the delay is expected, is there a way for me to force the network
system to process the deferred skb frees on request? e.g. write to a
file in /sys, or call a kernel function to flush the deferred frees?
Worst case is that I can redesign my code to kfree() the buffer if it is
busy and allocate a new one, and let the old buffer be freed
asynchronously when the skb is freed at some indeterminate time in the
future.
Thanks,
Tony Battersby
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-09 21:00 ` Long delay in freeing skbs Tony Battersby
@ 2026-03-09 21:07 ` Eric Dumazet
2026-03-09 21:18 ` Tony Battersby
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2026-03-09 21:07 UTC (permalink / raw)
To: Tony Battersby
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>
> I have some out-of-tree code that was broken by commit e20dfbad8aab
> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
> retested with 7.0-rc3 with the same results. I'm not sure if the change
> in behavior that I am seeing is expected or if it is a bug, so I thought
> I would ask. Omitting irrelevant details, my kernel code does the
> following:
>
> 1) allocate a buffer
> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
> 3) wait for the I/O to complete
> 4) wait for the buffer page refcounts to drop back to their values for
> idle I/O
> 5) reuse the buffer for more I/O
>
> Step 4 is necessary in this case due to some of those irrelevant details
> that I omitted, and that is where it trips up. Commit e20dfbad8aab
> seems to keep the page refcounts elevated long after the I/O has
> completed (hours or longer), presumably because the deferred skb free
> isn't happening for a long time. So:
>
> 1) Is this long delay expected, or is this a bug?
>
> 2) If the delay is expected, is there a way for me to force the network
> system to process the deferred skb frees on request? e.g. write to a
> file in /sys, or call a kernel function to flush the deferred frees?
>
> Worst case is that I can redesign my code to kfree() the buffer if it is
> busy and allocate a new one, and let the old buffer be freed
> asynchronously when the skb is freed at some indeterminate time in the
> future.
>
> Thanks,
> Tony Battersby
If you are using zero copy, please look at:
Maybe you should make sure your code is compatible with it.
commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Feb 16 19:36:53 2026 +0000
net: do not delay zero-copy skbs in skb_attempt_defer_free()
After the blamed commit, TCP tx zero copy notifications could be
arbitrarily delayed and cause regressions in applications waiting
for them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs")
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 699c401a5eae9c497a42b6bdd8593af7890529f4..dc47d3efc72ed86dce5e382d505eda7bc863669a
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7266,10 +7266,15 @@ void skb_attempt_defer_free(struct sk_buff *skb)
{
struct skb_defer_node *sdn;
unsigned long defer_count;
- int cpu = skb->alloc_cpu;
unsigned int defer_max;
bool kick;
+ int cpu;
+ /* zero copy notifications should not be delayed. */
+ if (skb_zcopy(skb))
+ goto nodefer;
+
+ cpu = skb->alloc_cpu;
if (cpu == raw_smp_processor_id() ||
WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
!cpu_online(cpu)) {
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-09 21:07 ` Eric Dumazet
@ 2026-03-09 21:18 ` Tony Battersby
2026-03-09 21:24 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Tony Battersby @ 2026-03-09 21:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On 3/9/26 17:07, Eric Dumazet wrote:
> On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>> I have some out-of-tree code that was broken by commit e20dfbad8aab
>> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
>> retested with 7.0-rc3 with the same results. I'm not sure if the change
>> in behavior that I am seeing is expected or if it is a bug, so I thought
>> I would ask. Omitting irrelevant details, my kernel code does the
>> following:
>>
>> 1) allocate a buffer
>> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
>> 3) wait for the I/O to complete
>> 4) wait for the buffer page refcounts to drop back to their values for
>> idle I/O
>> 5) reuse the buffer for more I/O
>>
>> Step 4 is necessary in this case due to some of those irrelevant details
>> that I omitted, and that is where it trips up. Commit e20dfbad8aab
>> seems to keep the page refcounts elevated long after the I/O has
>> completed (hours or longer), presumably because the deferred skb free
>> isn't happening for a long time. So:
>>
>> 1) Is this long delay expected, or is this a bug?
>>
>> 2) If the delay is expected, is there a way for me to force the network
>> system to process the deferred skb frees on request? e.g. write to a
>> file in /sys, or call a kernel function to flush the deferred frees?
>>
>> Worst case is that I can redesign my code to kfree() the buffer if it is
>> busy and allocate a new one, and let the old buffer be freed
>> asynchronously when the skb is freed at some indeterminate time in the
>> future.
>>
>> Thanks,
>> Tony Battersby
>
> If you are using zero copy, please look at:
>
> Maybe you should make sure your code is compatible with it.
>
> commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
> Author: Eric Dumazet <edumazet@google.com>
> Date: Mon Feb 16 19:36:53 2026 +0000
>
> net: do not delay zero-copy skbs in skb_attempt_defer_free()
>
> After the blamed commit, TCP tx zero copy notifications could be
> arbitrarily delayed and cause regressions in applications waiting
> for them.
>
The I/O is being sent by a userspace program via the SCSI generic driver
(drivers/scsi/sg.c) over iSCSI (drivers/scsi/iscsi_tcp.c). I don't see
any mention of zero copy in iscsi_tcp.c.
Tony
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-09 21:18 ` Tony Battersby
@ 2026-03-09 21:24 ` Eric Dumazet
2026-03-09 21:47 ` Tony Battersby
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2026-03-09 21:24 UTC (permalink / raw)
To: Tony Battersby
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On Mon, Mar 9, 2026 at 10:18 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>
> On 3/9/26 17:07, Eric Dumazet wrote:
> > On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
> >> I have some out-of-tree code that was broken by commit e20dfbad8aab
> >> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
> >> retested with 7.0-rc3 with the same results. I'm not sure if the change
> >> in behavior that I am seeing is expected or if it is a bug, so I thought
> >> I would ask. Omitting irrelevant details, my kernel code does the
> >> following:
> >>
> >> 1) allocate a buffer
> >> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
> >> 3) wait for the I/O to complete
> >> 4) wait for the buffer page refcounts to drop back to their values for
> >> idle I/O
> >> 5) reuse the buffer for more I/O
> >>
> >> Step 4 is necessary in this case due to some of those irrelevant details
> >> that I omitted, and that is where it trips up. Commit e20dfbad8aab
> >> seems to keep the page refcounts elevated long after the I/O has
> >> completed (hours or longer), presumably because the deferred skb free
> >> isn't happening for a long time. So:
> >>
> >> 1) Is this long delay expected, or is this a bug?
> >>
> >> 2) If the delay is expected, is there a way for me to force the network
> >> system to process the deferred skb frees on request? e.g. write to a
> >> file in /sys, or call a kernel function to flush the deferred frees?
> >>
> >> Worst case is that I can redesign my code to kfree() the buffer if it is
> >> busy and allocate a new one, and let the old buffer be freed
> >> asynchronously when the skb is freed at some indeterminate time in the
> >> future.
> >>
> >> Thanks,
> >> Tony Battersby
> >
> > If you are using zero copy, please look at:
> >
> > Maybe you should make sure your code is compatible with it.
> >
> > commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
> > Author: Eric Dumazet <edumazet@google.com>
> > Date: Mon Feb 16 19:36:53 2026 +0000
> >
> > net: do not delay zero-copy skbs in skb_attempt_defer_free()
> >
> > After the blamed commit, TCP tx zero copy notifications could be
> > arbitrarily delayed and cause regressions in applications waiting
> > for them.
> >
> The I/O is being sent by a userspace program via the SCSI generic driver
> (drivers/scsi/sg.c) over iSCSI (drivers/scsi/iscsi_tcp.c). I don't see
> any mention of zero copy in iscsi_tcp.c.
Set /proc/sys/net/core/skb_defer_max to 0 ?
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-09 21:24 ` Eric Dumazet
@ 2026-03-09 21:47 ` Tony Battersby
2026-03-10 14:49 ` Tony Battersby
0 siblings, 1 reply; 37+ messages in thread
From: Tony Battersby @ 2026-03-09 21:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On 3/9/26 17:24, Eric Dumazet wrote:
> On Mon, Mar 9, 2026 at 10:18 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>> On 3/9/26 17:07, Eric Dumazet wrote:
>>> On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>>>> I have some out-of-tree code that was broken by commit e20dfbad8aab
>>>> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
>>>> retested with 7.0-rc3 with the same results. I'm not sure if the change
>>>> in behavior that I am seeing is expected or if it is a bug, so I thought
>>>> I would ask. Omitting irrelevant details, my kernel code does the
>>>> following:
>>>>
>>>> 1) allocate a buffer
>>>> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
>>>> 3) wait for the I/O to complete
>>>> 4) wait for the buffer page refcounts to drop back to their values for
>>>> idle I/O
>>>> 5) reuse the buffer for more I/O
>>>>
>>>> Step 4 is necessary in this case due to some of those irrelevant details
>>>> that I omitted, and that is where it trips up. Commit e20dfbad8aab
>>>> seems to keep the page refcounts elevated long after the I/O has
>>>> completed (hours or longer), presumably because the deferred skb free
>>>> isn't happening for a long time. So:
>>>>
>>>> 1) Is this long delay expected, or is this a bug?
>>>>
>>>> 2) If the delay is expected, is there a way for me to force the network
>>>> system to process the deferred skb frees on request? e.g. write to a
>>>> file in /sys, or call a kernel function to flush the deferred frees?
>>>>
>>>> Worst case is that I can redesign my code to kfree() the buffer if it is
>>>> busy and allocate a new one, and let the old buffer be freed
>>>> asynchronously when the skb is freed at some indeterminate time in the
>>>> future.
>>>>
>>>> Thanks,
>>>> Tony Battersby
>>> If you are using zero copy, please look at:
>>>
>>> Maybe you should make sure your code is compatible with it.
>>>
>>> commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date: Mon Feb 16 19:36:53 2026 +0000
>>>
>>> net: do not delay zero-copy skbs in skb_attempt_defer_free()
>>>
>>> After the blamed commit, TCP tx zero copy notifications could be
>>> arbitrarily delayed and cause regressions in applications waiting
>>> for them.
>>>
>> The I/O is being sent by a userspace program via the SCSI generic driver
>> (drivers/scsi/sg.c) over iSCSI (drivers/scsi/iscsi_tcp.c). I don't see
>> any mention of zero copy in iscsi_tcp.c.
> Set /proc/sys/net/core/skb_defer_max to 0 ?
Thanks; I should have grepped the docs harder.
That seems to work if I do it before sending the I/O but not after the
I/O has completed. So I presume it would decrease performance?
Tony
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-09 21:47 ` Tony Battersby
@ 2026-03-10 14:49 ` Tony Battersby
2026-03-10 15:25 ` Eric Dumazet
0 siblings, 1 reply; 37+ messages in thread
From: Tony Battersby @ 2026-03-10 14:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On 3/9/26 17:47, Tony Battersby wrote:
> On 3/9/26 17:24, Eric Dumazet wrote:
>> On Mon, Mar 9, 2026 at 10:18 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>>> On 3/9/26 17:07, Eric Dumazet wrote:
>>>> On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>>>>> I have some out-of-tree code that was broken by commit e20dfbad8aab
>>>>> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
>>>>> retested with 7.0-rc3 with the same results. I'm not sure if the change
>>>>> in behavior that I am seeing is expected or if it is a bug, so I thought
>>>>> I would ask. Omitting irrelevant details, my kernel code does the
>>>>> following:
>>>>>
>>>>> 1) allocate a buffer
>>>>> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
>>>>> 3) wait for the I/O to complete
>>>>> 4) wait for the buffer page refcounts to drop back to their values for
>>>>> idle I/O
>>>>> 5) reuse the buffer for more I/O
>>>>>
>>>>> Step 4 is necessary in this case due to some of those irrelevant details
>>>>> that I omitted, and that is where it trips up. Commit e20dfbad8aab
>>>>> seems to keep the page refcounts elevated long after the I/O has
>>>>> completed (hours or longer), presumably because the deferred skb free
>>>>> isn't happening for a long time. So:
>>>>>
>>>>> 1) Is this long delay expected, or is this a bug?
>>>>>
>>>>> 2) If the delay is expected, is there a way for me to force the network
>>>>> system to process the deferred skb frees on request? e.g. write to a
>>>>> file in /sys, or call a kernel function to flush the deferred frees?
>>>>>
>>>>> Worst case is that I can redesign my code to kfree() the buffer if it is
>>>>> busy and allocate a new one, and let the old buffer be freed
>>>>> asynchronously when the skb is freed at some indeterminate time in the
>>>>> future.
>>>>>
>>>>> Thanks,
>>>>> Tony Battersby
>>>> If you are using zero copy, please look at:
>>>>
>>>> Maybe you should make sure your code is compatible with it.
>>>>
>>>> commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date: Mon Feb 16 19:36:53 2026 +0000
>>>>
>>>> net: do not delay zero-copy skbs in skb_attempt_defer_free()
>>>>
>>>> After the blamed commit, TCP tx zero copy notifications could be
>>>> arbitrarily delayed and cause regressions in applications waiting
>>>> for them.
>>>>
>>> The I/O is being sent by a userspace program via the SCSI generic driver
>>> (drivers/scsi/sg.c) over iSCSI (drivers/scsi/iscsi_tcp.c). I don't see
>>> any mention of zero copy in iscsi_tcp.c.
>> Set /proc/sys/net/core/skb_defer_max to 0 ?
> Thanks; I should have grepped the docs harder.
>
> That seems to work if I do it before sending the I/O but not after the
> I/O has completed. So I presume it would decrease performance?
>
> Tony
>
I found a workable solution. When my code finds an elevated page
refcount, it does the following:
for_each_online_cpu(cpu) {
kick_defer_list_purge(cpu);
}
And that seems to take care of it, and it pushes the extra overhead to
the slow path.
Thanks for your help!
Tony Battersby
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Long delay in freeing skbs
2026-03-10 14:49 ` Tony Battersby
@ 2026-03-10 15:25 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2026-03-10 15:25 UTC (permalink / raw)
To: Tony Battersby
Cc: netdev, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
eric.dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Toke Høiland-Jørgensen, Jason Xing
On Tue, Mar 10, 2026 at 3:49 PM Tony Battersby <tonyb@cybernetics.com> wrote:
>
> On 3/9/26 17:47, Tony Battersby wrote:
> > On 3/9/26 17:24, Eric Dumazet wrote:
> >> On Mon, Mar 9, 2026 at 10:18 PM Tony Battersby <tonyb@cybernetics.com> wrote:
> >>> On 3/9/26 17:07, Eric Dumazet wrote:
> >>>> On Mon, Mar 9, 2026 at 10:00 PM Tony Battersby <tonyb@cybernetics.com> wrote:
> >>>>> I have some out-of-tree code that was broken by commit e20dfbad8aab
> >>>>> ("net: fix napi_consume_skb() with alien skbs") in kernel 6.19. I
> >>>>> retested with 7.0-rc3 with the same results. I'm not sure if the change
> >>>>> in behavior that I am seeing is expected or if it is a bug, so I thought
> >>>>> I would ask. Omitting irrelevant details, my kernel code does the
> >>>>> following:
> >>>>>
> >>>>> 1) allocate a buffer
> >>>>> 2) perform some network I/O using the buffer (cmd over iscsi_tcp)
> >>>>> 3) wait for the I/O to complete
> >>>>> 4) wait for the buffer page refcounts to drop back to their values for
> >>>>> idle I/O
> >>>>> 5) reuse the buffer for more I/O
> >>>>>
> >>>>> Step 4 is necessary in this case due to some of those irrelevant details
> >>>>> that I omitted, and that is where it trips up. Commit e20dfbad8aab
> >>>>> seems to keep the page refcounts elevated long after the I/O has
> >>>>> completed (hours or longer), presumably because the deferred skb free
> >>>>> isn't happening for a long time. So:
> >>>>>
> >>>>> 1) Is this long delay expected, or is this a bug?
> >>>>>
> >>>>> 2) If the delay is expected, is there a way for me to force the network
> >>>>> system to process the deferred skb frees on request? e.g. write to a
> >>>>> file in /sys, or call a kernel function to flush the deferred frees?
> >>>>>
> >>>>> Worst case is that I can redesign my code to kfree() the buffer if it is
> >>>>> busy and allocate a new one, and let the old buffer be freed
> >>>>> asynchronously when the skb is freed at some indeterminate time in the
> >>>>> future.
> >>>>>
> >>>>> Thanks,
> >>>>> Tony Battersby
> >>>> If you are using zero copy, please look at:
> >>>>
> >>>> Maybe you should make sure your code is compatible with it.
> >>>>
> >>>> commit 0943404b1f3b178e1e54386dadcbf4f2729c7762
> >>>> Author: Eric Dumazet <edumazet@google.com>
> >>>> Date: Mon Feb 16 19:36:53 2026 +0000
> >>>>
> >>>> net: do not delay zero-copy skbs in skb_attempt_defer_free()
> >>>>
> >>>> After the blamed commit, TCP tx zero copy notifications could be
> >>>> arbitrarily delayed and cause regressions in applications waiting
> >>>> for them.
> >>>>
> >>> The I/O is being sent by a userspace program via the SCSI generic driver
> >>> (drivers/scsi/sg.c) over iSCSI (drivers/scsi/iscsi_tcp.c). I don't see
> >>> any mention of zero copy in iscsi_tcp.c.
> >> Set /proc/sys/net/core/skb_defer_max to 0 ?
> > Thanks; I should have grepped the docs harder.
> >
> > That seems to work if I do it before sending the I/O but not after the
> > I/O has completed. So I presume it would decrease performance?
> >
> > Tony
> >
> I found a workable solution. When my code finds an elevated page
> refcount, it does the following:
>
> for_each_online_cpu(cpu) {
> kick_defer_list_purge(cpu);
> }
>
> And that seems to take care of it, and it pushes the extra overhead to
> the slow path.
>
> Thanks for your help!
You probably could send packets over loopback interface, this should trigger
net_rx_action() from one cpu only, this might be less disruptive, but maybe
a bit more complicated to get it right.
Another way is to setup RFS on your NIC.
Documentation/networking/scaling.rst
Thanks.
^ permalink raw reply [flat|nested] 37+ messages in thread