* [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb()
@ 2025-11-06 20:29 Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
` (3 more replies)
0 siblings, 4 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.
Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.
Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.
This removes contention on SLUB spinlocks and data structures,
and this makes sure that recycled sk_buff have correct NUMA locality.
After this series, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
I will later refactor skb_attempt_defer_free()
to no longer have to care of skb_shared() and skb_release_head_state().
Eric Dumazet (3):
net: allow skb_release_head_state() to be called multiple times
net: fix napi_consume_skb() with alien skbs
net: increase skb_defer_max default to 128
Documentation/admin-guide/sysctl/net.rst | 4 ++--
net/core/hotdata.c | 2 +-
net/core/skbuff.c | 12 ++++++++----
3 files changed, 11 insertions(+), 7 deletions(-)
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
2025-11-07 6:42 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (2 subsequent siblings)
3 siblings, 2 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Currently, only skb dst is cleared (thanks to skb_dst_drop())
Make sure skb->destructor, conntrack and extensions are cleared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/skbuff.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b4bc8b1c7d5674c19b64f8b15685d74632048fe..eeddb9e737ff28e47c77739db7b25ea68e5aa735 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1149,11 +1149,10 @@ void skb_release_head_state(struct sk_buff *skb)
skb);
#endif
+ skb->destructor = NULL;
}
-#if IS_ENABLED(CONFIG_NF_CONNTRACK)
- nf_conntrack_put(skb_nfct(skb));
-#endif
- skb_ext_put(skb);
+ nf_reset_ct(skb);
+ skb_ext_reset(skb);
}
/* Free everything but the sk_buff shell. */
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
` (4 more replies)
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-08 3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
3 siblings, 5 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.
Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.
Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.
This removes contention on SLUB spinlocks and data structures.
After this patch, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
80 Mpps -> 120 Mpps.
Profiling one of the 32 cpus servicing NIC interrupts :
Before:
mpstat -P 511 1 1
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
2.11% swapper [kernel.kallsyms] [k] __slab_free
2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
1.03% swapper [kernel.kallsyms] [k] fq_dequeue
0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
0.93% swapper [kernel.kallsyms] [k] read_tsc
0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
0.76% swapper [kernel.kallsyms] [k] idpf_features_check
0.72% swapper [kernel.kallsyms] [k] skb_release_data
0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
0.48% swapper [kernel.kallsyms] [k] sock_wfree
After:
mpstat -P 511 1 1
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.69% swapper [kernel.kallsyms] [k] sock_wfree
5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
3.10% swapper [kernel.kallsyms] [k] fq_dequeue
3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
2.73% swapper [kernel.kallsyms] [k] read_tsc
2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
1.20% swapper [kernel.kallsyms] [k] idpf_features_check
1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
0.53% swapper [kernel.kallsyms] [k] io_idle
0.43% swapper [kernel.kallsyms] [k] netif_skb_features
0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
0.34% swapper [kernel.kallsyms] [k] handle_softirqs
0.32% swapper [kernel.kallsyms] [k] net_rx_action
0.32% swapper [kernel.kallsyms] [k] dql_completed
0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
0.28% swapper [kernel.kallsyms] [k] ktime_get
0.24% swapper [kernel.kallsyms] [k] __qdisc_run
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/skbuff.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+ if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
+ skb_release_head_state(skb);
+ return skb_attempt_defer_free(skb);
+ }
+
if (!skb_unref(skb))
return;
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
2025-11-07 6:47 ` Kuniyuki Iwashima
` (2 more replies)
2025-11-08 3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
3 siblings, 3 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
skb_defer_max value is very conservative, and can be increased
to avoid too many calls to kick_defer_list_purge().
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Documentation/admin-guide/sysctl/net.rst | 4 ++--
net/core/hotdata.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 991773dcb9cfe57f64bffabc018549b712aed9b0..369a738a68193e897d880eeb2c5a22cd90833938 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -355,9 +355,9 @@ skb_defer_max
-------------
Max size (in skbs) of the per-cpu list of skbs being freed
-by the cpu which allocated them. Used by TCP stack so far.
+by the cpu which allocated them.
-Default: 64
+Default: 128
optmem_max
----------
diff --git a/net/core/hotdata.c b/net/core/hotdata.c
index 95d0a4df10069e4529fb9e5b58e8391574085cf1..dddd5c287cf08ba75aec1cc546fd1bc48c0f7b26 100644
--- a/net/core/hotdata.c
+++ b/net/core/hotdata.c
@@ -20,7 +20,7 @@ struct net_hotdata net_hotdata __cacheline_aligned = {
.dev_tx_weight = 64,
.dev_rx_weight = 64,
.sysctl_max_skb_frags = MAX_SKB_FRAGS,
- .sysctl_skb_defer_max = 64,
+ .sysctl_skb_defer_max = 128,
.sysctl_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE
};
EXPORT_SYMBOL(net_hotdata);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
@ 2025-11-07 6:42 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
1 sibling, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07 6:42 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Currently, only skb dst is cleared (thanks to skb_dst_drop())
>
> Make sure skb->destructor, conntrack and extensions are cleared.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
@ 2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
` (3 subsequent siblings)
4 siblings, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07 6:46 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Big improvement again!
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
@ 2025-11-07 6:47 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 15:37 ` Jason Xing
2 siblings, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07 6:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07 6:42 ` Kuniyuki Iwashima
@ 2025-11-07 11:23 ` Toke Høiland-Jørgensen
1 sibling, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> Currently, only skb dst is cleared (thanks to skb_dst_drop())
>
> Make sure skb->destructor, conntrack and extensions are cleared.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
@ 2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 12:28 ` Eric Dumazet
2025-11-07 15:44 ` Jason Xing
` (2 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Impressive!
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07 6:47 ` Kuniyuki Iwashima
@ 2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 15:37 ` Jason Xing
2 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-07 11:23 ` Toke Høiland-Jørgensen
@ 2025-11-07 12:28 ` Eric Dumazet
2025-11-07 14:34 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 12:28 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Impressive!
>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Thanks !
Note that my upcoming plan is also to plumb skb_attempt_defer_free()
into __kfree_skb().
[ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]
TCP under pressure would benefit from this a _lot_.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-07 12:28 ` Eric Dumazet
@ 2025-11-07 14:34 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 14:34 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
Eric Dumazet <edumazet@google.com> writes:
> On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Impressive!
>>
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Thanks !
>
> Note that my upcoming plan is also to plumb skb_attempt_defer_free()
> into __kfree_skb().
>
> [ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]
>
> TCP under pressure would benefit from this a _lot_.
Interesting; look forward to seeing the results :)
-Toke
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07 6:47 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
@ 2025-11-07 15:37 ` Jason Xing
2025-11-07 15:47 ` Eric Dumazet
2 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
well since the freeing skb happens in the softirq context, which I
came up with when I was doing the optimization for af_xdp. That is
also used to defer freeing skb to obtain some improvement in
performance. I'd like to know your opinion on this, thanks in advance!
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
@ 2025-11-07 15:44 ` Jason Xing
2025-11-11 16:56 ` Aditya Garg
2025-11-12 14:03 ` Jon Hunter
4 siblings, 0 replies; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Thanks for your brilliant work that really gives me so much
inspiration one more time.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks,
Jason
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 15:37 ` Jason Xing
@ 2025-11-07 15:47 ` Eric Dumazet
2025-11-07 15:49 ` Jason Xing
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 15:47 UTC (permalink / raw)
To: Jason Xing
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > skb_defer_max value is very conservative, and can be increased
> > to avoid too many calls to kick_defer_list_purge().
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> well since the freeing skb happens in the softirq context, which I
> came up with when I was doing the optimization for af_xdp. That is
> also used to defer freeing skb to obtain some improvement in
> performance. I'd like to know your opinion on this, thanks in advance!
Makes sense. I even had a patch like this in my queue ;)
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 15:47 ` Eric Dumazet
@ 2025-11-07 15:49 ` Jason Xing
2025-11-07 16:00 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > skb_defer_max value is very conservative, and can be increased
> > > to avoid too many calls to kick_defer_list_purge().
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > well since the freeing skb happens in the softirq context, which I
> > came up with when I was doing the optimization for af_xdp. That is
> > also used to defer freeing skb to obtain some improvement in
> > performance. I'd like to know your opinion on this, thanks in advance!
>
> Makes sense. I even had a patch like this in my queue ;)
Great to hear that. Look forward to seeing it soon :)
Thanks,
Jason
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 15:49 ` Jason Xing
@ 2025-11-07 16:00 ` Eric Dumazet
2025-11-07 16:03 ` Jason Xing
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:00 UTC (permalink / raw)
To: Jason Xing
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > skb_defer_max value is very conservative, and can be increased
> > > > to avoid too many calls to kick_defer_list_purge().
> > > >
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >
> > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > well since the freeing skb happens in the softirq context, which I
> > > came up with when I was doing the optimization for af_xdp. That is
> > > also used to defer freeing skb to obtain some improvement in
> > > performance. I'd like to know your opinion on this, thanks in advance!
> >
> > Makes sense. I even had a patch like this in my queue ;)
>
> Great to hear that. Look forward to seeing it soon :)
Oh please go ahead !
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 16:00 ` Eric Dumazet
@ 2025-11-07 16:03 ` Jason Xing
2025-11-07 16:08 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > skb_defer_max value is very conservative, and can be increased
> > > > > to avoid too many calls to kick_defer_list_purge().
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > >
> > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > well since the freeing skb happens in the softirq context, which I
> > > > came up with when I was doing the optimization for af_xdp. That is
> > > > also used to defer freeing skb to obtain some improvement in
> > > > performance. I'd like to know your opinion on this, thanks in advance!
> > >
> > > Makes sense. I even had a patch like this in my queue ;)
> >
> > Great to hear that. Look forward to seeing it soon :)
>
> Oh please go ahead !
Okay, thanks for letting me post this minor change. I just thought you
wanted to do this on your own :P
Will do it soon :)
Thanks,
Jason
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 16:03 ` Jason Xing
@ 2025-11-07 16:08 ` Eric Dumazet
2025-11-07 16:20 ` Jason Xing
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:08 UTC (permalink / raw)
To: Jason Xing
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > >
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > well since the freeing skb happens in the softirq context, which I
> > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > also used to defer freeing skb to obtain some improvement in
> > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > >
> > > > Makes sense. I even had a patch like this in my queue ;)
> > >
> > > Great to hear that. Look forward to seeing it soon :)
> >
> > Oh please go ahead !
>
> Okay, thanks for letting me post this minor change. I just thought you
> wanted to do this on your own :P
>
> Will do it soon :)
Note that I was thinking to free only 32 skbs if we fill up the array
completely.
Current code frees half of it, this seems better trying to keep 96
skbs and free 32 of them.
Same for the bulk alloc, we could probably go to 32 (instead of 16)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 16:08 ` Eric Dumazet
@ 2025-11-07 16:20 ` Jason Xing
2025-11-07 16:26 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > >
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > >
> > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > >
> > > > > Makes sense. I even had a patch like this in my queue ;)
> > > >
> > > > Great to hear that. Look forward to seeing it soon :)
> > >
> > > Oh please go ahead !
> >
> > Okay, thanks for letting me post this minor change. I just thought you
> > wanted to do this on your own :P
> >
> > Will do it soon :)
>
> Note that I was thinking to free only 32 skbs if we fill up the array
> completely.
>
> Current code frees half of it, this seems better trying to keep 96
> skbs and free 32 of them.
>
> Same for the bulk alloc, we could probably go to 32 (instead of 16)
Thanks for your suggestion!
However, sorry, I didn't get it totally. I'm wondering what the
difference between freeing only 32 and freeing half of the new value
is? My thought was freeing the half, say, 128/2, which minimizes more
times of performing skb free functions. Could you shed some light on
those numbers?
Thanks,
Jason
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 16:20 ` Jason Xing
@ 2025-11-07 16:26 ` Eric Dumazet
2025-11-07 16:59 ` Jason Xing
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:26 UTC (permalink / raw)
To: Jason Xing
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Nov 7, 2025 at 8:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > >
> > > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > > >
> > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > >
> > > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > > >
> > > > > > Makes sense. I even had a patch like this in my queue ;)
> > > > >
> > > > > Great to hear that. Look forward to seeing it soon :)
> > > >
> > > > Oh please go ahead !
> > >
> > > Okay, thanks for letting me post this minor change. I just thought you
> > > wanted to do this on your own :P
> > >
> > > Will do it soon :)
> >
> > Note that I was thinking to free only 32 skbs if we fill up the array
> > completely.
> >
> > Current code frees half of it, this seems better trying to keep 96
> > skbs and free 32 of them.
> >
> > Same for the bulk alloc, we could probably go to 32 (instead of 16)
>
> Thanks for your suggestion!
>
> However, sorry, I didn't get it totally. I'm wondering what the
> difference between freeing only 32 and freeing half of the new value
> is? My thought was freeing the half, say, 128/2, which minimizes more
> times of performing skb free functions. Could you shed some light on
> those numbers?
If we free half, a subsequent net_rx_action() calling 2 or 3 times a
napi_poll() will exhaust the remaining 64.
This is a per-cpu reserve of sk_buff (256 bytes each). I think we can
afford having them in the pool just in case...
I also had a prefetch() in napi_skb_cache_get() :
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03..3d40c4b0c580afc183c30e2efb0f953d0d5aabf9
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -297,6 +297,8 @@ static struct sk_buff *napi_skb_cache_get(void)
}
skb = nc->skb_cache[--nc->skb_count];
+ if (nc->skb_count)
+ prefetch(nc->skb_cache[nc->skb_count - 1]);
local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
kasan_mempool_unpoison_object(skb, skbuff_cache_size);
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
2025-11-07 16:26 ` Eric Dumazet
@ 2025-11-07 16:59 ` Jason Xing
0 siblings, 0 replies; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:59 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Sat, Nov 8, 2025 at 12:26 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > > >
> > > > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > > >
> > > > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > > > >
> > > > > > > Makes sense. I even had a patch like this in my queue ;)
> > > > > >
> > > > > > Great to hear that. Look forward to seeing it soon :)
> > > > >
> > > > > Oh please go ahead !
> > > >
> > > > Okay, thanks for letting me post this minor change. I just thought you
> > > > wanted to do this on your own :P
> > > >
> > > > Will do it soon :)
> > >
> > > Note that I was thinking to free only 32 skbs if we fill up the array
> > > completely.
> > >
> > > Current code frees half of it, this seems better trying to keep 96
> > > skbs and free 32 of them.
> > >
> > > Same for the bulk alloc, we could probably go to 32 (instead of 16)
> >
> > Thanks for your suggestion!
> >
> > However, sorry, I didn't get it totally. I'm wondering what the
> > difference between freeing only 32 and freeing half of the new value
> > is? My thought was freeing the half, say, 128/2, which minimizes more
> > times of performing skb free functions. Could you shed some light on
> > those numbers?
>
> If we free half, a subsequent net_rx_action() calling 2 or 3 times a
> napi_poll() will exhaust the remaining 64.
>
> This is a per-cpu reserve of sk_buff (256 bytes each). I think we can
> afford having them in the pool just in case...
I think I understand what you meant:
1) Freeing half of that (which is even though more than 32) doesn't
make a big deal as it will be consumed in few rounds. Thus, we don't
need to adjust the policy of freeing skbs.
2) Enlarging the volume of this pool makes more sense. It will then be
increased from 32 to 64.
3) Also increase NAPI_SKB_CACHE_BULK to 32 in accordance with the above updates.
If I'm missing something, please point out the direction I should take :)
>
> I also had a prefetch() in napi_skb_cache_get() :
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03..3d40c4b0c580afc183c30e2efb0f953d0d5aabf9
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -297,6 +297,8 @@ static struct sk_buff *napi_skb_cache_get(void)
> }
>
> skb = nc->skb_cache[--nc->skb_count];
> + if (nc->skb_count)
> + prefetch(nc->skb_cache[nc->skb_count - 1]);
> local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
> kasan_mempool_unpoison_object(skb, skbuff_cache_size);
Interesting. Thanks for your suggestion. I think I will include this :)
Thanks,
Jason
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb()
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
` (2 preceding siblings ...)
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
@ 2025-11-08 3:10 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 30+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-08 3:10 UTC (permalink / raw)
To: Eric Dumazet
Cc: davem, kuba, pabeni, horms, kuniyu, willemb, netdev, eric.dumazet
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 6 Nov 2025 20:29:32 +0000 you wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> [...]
Here is the summary with links:
- [net-next,1/3] net: allow skb_release_head_state() to be called multiple times
https://git.kernel.org/netdev/net-next/c/1fcf572211da
- [net-next,2/3] net: fix napi_consume_skb() with alien skbs
https://git.kernel.org/netdev/net-next/c/e20dfbad8aab
- [net-next,3/3] net: increase skb_defer_max default to 128
https://git.kernel.org/netdev/net-next/c/b61785852ed0
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (2 preceding siblings ...)
2025-11-07 15:44 ` Jason Xing
@ 2025-11-11 16:56 ` Aditya Garg
2025-11-11 17:17 ` Eric Dumazet
2025-11-12 14:03 ` Jon Hunter
4 siblings, 1 reply; 30+ messages in thread
From: Aditya Garg @ 2025-11-11 16:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
> --
> 2.51.2.1041.gc1ab5b90ca-goog
>
I ran these tests on latest net-next for MANA driver and I am observing a regression here.
lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
[ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
[ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
[ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
[ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
[ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
[ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
[ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
iperf Done.
I tested again by reverting this patch and regression was not there.
lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
[ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
[ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
[ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
[ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
[ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
[ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
[ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
[ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
[ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
[ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
[ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
[ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
[ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
[ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
[ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
[ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
[ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
[ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
[ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
[ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
[ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
[ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
[ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
[ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
[ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
[ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
[ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
[ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
[ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
[ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
iperf Done.
I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
Regards,
Aditya
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-11 16:56 ` Aditya Garg
@ 2025-11-11 17:17 ` Eric Dumazet
2025-11-12 5:14 ` Aditya Garg
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-11 17:17 UTC (permalink / raw)
To: Aditya Garg
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
<gargaditya@linux.microsoft.com> wrote:
>
> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
> >
> > 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> > 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> > 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> > 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> > 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> > 2.11% swapper [kernel.kallsyms] [k] __slab_free
> > 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> > 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> > 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> > 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> > 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> > 0.93% swapper [kernel.kallsyms] [k] read_tsc
> > 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> > 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> > 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> > 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> > 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> > 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> > 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> > 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> > 0.48% swapper [kernel.kallsyms] [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
> >
> > 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> > 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> > 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> > 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> > 2.73% swapper [kernel.kallsyms] [k] read_tsc
> > 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> > 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> > 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> > 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> > 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> > 0.53% swapper [kernel.kallsyms] [k] io_idle
> > 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> > 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> > 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> > 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> > 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> > 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> > 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> > 0.32% swapper [kernel.kallsyms] [k] dql_completed
> > 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> > 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> > 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> > 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> > 0.28% swapper [kernel.kallsyms] [k] ktime_get
> > 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > net/core/skbuff.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > + skb_release_head_state(skb);
> > + return skb_attempt_defer_free(skb);
> > + }
> > +
> > if (!skb_unref(skb))
> > return;
> >
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
>
> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>
> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
> [ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
> [ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
> [ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
> [ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
> [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> [ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
>
> iperf Done.
>
>
> I tested again by reverting this patch and regression was not there.
>
> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
> [ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
> [ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
> [ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
> [ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
> [ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
> [ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
> [ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
> [ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
> [ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
> [ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
> [ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
> [ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
> [ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
> [ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
> [ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
> [ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
> [ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
> [ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
> [ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
> [ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
> [ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
> [ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
> [ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
> [ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
> [ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
> [ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
> [ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
> [ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
> [ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
>
> iperf Done.
>
>
> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
Thanks !
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-11 17:17 ` Eric Dumazet
@ 2025-11-12 5:14 ` Aditya Garg
0 siblings, 0 replies; 30+ messages in thread
From: Aditya Garg @ 2025-11-12 5:14 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
ssengar, gargaditya
On 11-11-2025 22:47, Eric Dumazet wrote:
> On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
> <gargaditya@linux.microsoft.com> wrote:
>>
>> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
>>> There is a lack of NUMA awareness and more generally lack
>>> of slab caches affinity on TX completion path.
>>>
>>> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
>>> in per-cpu caches so that they can be recycled in RX path.
>>>
>>> Only use this if the skb was allocated on the same cpu,
>>> otherwise use skb_attempt_defer_free() so that the skb
>>> is freed on the original cpu.
>>>
>>> This removes contention on SLUB spinlocks and data structures.
>>>
>>> After this patch, I get ~50% improvement for an UDP tx workload
>>> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>>>
>>> 80 Mpps -> 120 Mpps.
>>>
>>> Profiling one of the 32 cpus servicing NIC interrupts :
>>>
>>> Before:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>>>
>>> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
>>> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
>>> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
>>> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
>>> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
>>> 2.11% swapper [kernel.kallsyms] [k] __slab_free
>>> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
>>> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
>>> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
>>> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
>>> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
>>> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
>>> 0.93% swapper [kernel.kallsyms] [k] read_tsc
>>> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
>>> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
>>> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
>>> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
>>> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
>>> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
>>> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
>>> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
>>> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>>>
>>> After:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>>>
>>> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
>>> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
>>> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
>>> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
>>> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
>>> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
>>> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
>>> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
>>> 2.73% swapper [kernel.kallsyms] [k] read_tsc
>>> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
>>> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
>>> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
>>> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
>>> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
>>> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
>>> 0.53% swapper [kernel.kallsyms] [k] io_idle
>>> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
>>> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
>>> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
>>> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
>>> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
>>> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
>>> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
>>> 0.32% swapper [kernel.kallsyms] [k] dql_completed
>>> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
>>> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
>>> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
>>> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
>>> 0.28% swapper [kernel.kallsyms] [k] ktime_get
>>> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>> net/core/skbuff.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>>>
>>> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>>>
>>> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
>>> + skb_release_head_state(skb);
>>> + return skb_attempt_defer_free(skb);
>>> + }
>>> +
>>> if (!skb_unref(skb))
>>> return;
>>>
>>> --
>>> 2.51.2.1041.gc1ab5b90ca-goog
>>>
>>
>> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>>
>> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [ 5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 4.57 GBytes 39.2 Gbits/sec 586 1.04 MBytes
>> [ 5] 1.00-2.00 sec 4.74 GBytes 40.7 Gbits/sec 520 1.13 MBytes
>> [ 5] 2.00-3.00 sec 5.16 GBytes 44.3 Gbits/sec 191 1.20 MBytes
>> [ 5] 3.00-4.00 sec 5.13 GBytes 44.1 Gbits/sec 520 1.11 MBytes
>> [ 5] 4.00-5.00 sec 678 MBytes 5.69 Gbits/sec 93 1.37 KBytes
>> [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 20.00-21.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 21.00-22.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 22.00-23.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 23.00-24.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 24.00-25.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 25.00-26.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 26.00-27.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 27.00-28.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 28.00-29.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> [ 5] 29.00-30.00 sec 0.00 Bytes 0.00 bits/sec 0 1.37 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bitrate Retr
>> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec 1910 sender
>> [ 5] 0.00-30.00 sec 20.3 GBytes 5.80 Gbits/sec receiver
>>
>> iperf Done.
>>
>>
>> I tested again by reverting this patch and regression was not there.
>>
>> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [ 5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 4.95 GBytes 42.5 Gbits/sec 541 1.10 MBytes
>> [ 5] 1.00-2.00 sec 4.92 GBytes 42.3 Gbits/sec 599 878 KBytes
>> [ 5] 2.00-3.00 sec 4.51 GBytes 38.7 Gbits/sec 438 803 KBytes
>> [ 5] 3.00-4.00 sec 4.69 GBytes 40.3 Gbits/sec 647 1.17 MBytes
>> [ 5] 4.00-5.00 sec 4.18 GBytes 35.9 Gbits/sec 1183 715 KBytes
>> [ 5] 5.00-6.00 sec 5.05 GBytes 43.4 Gbits/sec 484 975 KBytes
>> [ 5] 6.00-7.00 sec 5.32 GBytes 45.7 Gbits/sec 520 836 KBytes
>> [ 5] 7.00-8.00 sec 5.29 GBytes 45.5 Gbits/sec 436 1.10 MBytes
>> [ 5] 8.00-9.00 sec 5.27 GBytes 45.2 Gbits/sec 464 1.30 MBytes
>> [ 5] 9.00-10.00 sec 5.25 GBytes 45.1 Gbits/sec 425 1.13 MBytes
>> [ 5] 10.00-11.00 sec 5.29 GBytes 45.4 Gbits/sec 268 1.19 MBytes
>> [ 5] 11.00-12.00 sec 4.98 GBytes 42.8 Gbits/sec 711 793 KBytes
>> [ 5] 12.00-13.00 sec 3.80 GBytes 32.6 Gbits/sec 1255 801 KBytes
>> [ 5] 13.00-14.00 sec 3.80 GBytes 32.7 Gbits/sec 1130 642 KBytes
>> [ 5] 14.00-15.00 sec 4.31 GBytes 37.0 Gbits/sec 1024 1.11 MBytes
>> [ 5] 15.00-16.00 sec 5.18 GBytes 44.5 Gbits/sec 359 1.25 MBytes
>> [ 5] 16.00-17.00 sec 5.23 GBytes 44.9 Gbits/sec 265 900 KBytes
>> [ 5] 17.00-18.00 sec 4.70 GBytes 40.4 Gbits/sec 769 715 KBytes
>> [ 5] 18.00-19.00 sec 3.77 GBytes 32.4 Gbits/sec 1841 889 KBytes
>> [ 5] 19.00-20.00 sec 3.77 GBytes 32.4 Gbits/sec 1084 827 KBytes
>> [ 5] 20.00-21.00 sec 5.01 GBytes 43.0 Gbits/sec 558 994 KBytes
>> [ 5] 21.00-22.00 sec 5.27 GBytes 45.3 Gbits/sec 450 1.25 MBytes
>> [ 5] 22.00-23.00 sec 5.25 GBytes 45.1 Gbits/sec 338 1.18 MBytes
>> [ 5] 23.00-24.00 sec 5.29 GBytes 45.4 Gbits/sec 200 1.14 MBytes
>> [ 5] 24.00-25.00 sec 5.29 GBytes 45.5 Gbits/sec 518 1.02 MBytes
>> [ 5] 25.00-26.00 sec 4.28 GBytes 36.7 Gbits/sec 1258 792 KBytes
>> [ 5] 26.00-27.00 sec 3.87 GBytes 33.2 Gbits/sec 1365 799 KBytes
>> [ 5] 27.00-28.00 sec 4.77 GBytes 41.0 Gbits/sec 530 1.09 MBytes
>> [ 5] 28.00-29.00 sec 5.31 GBytes 45.6 Gbits/sec 419 1.06 MBytes
>> [ 5] 29.00-30.00 sec 5.32 GBytes 45.7 Gbits/sec 222 1.10 MBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval Transfer Bitrate Retr
>> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec 20301 sender
>> [ 5] 0.00-30.00 sec 144 GBytes 41.2 Gbits/sec receiver
>>
>> iperf Done.
>>
>>
>> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
>
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
>
> Thanks !
Thanks Eric, it works fine after above fix.
Regards,
Aditya
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
` (3 preceding siblings ...)
2025-11-11 16:56 ` Aditya Garg
@ 2025-11-12 14:03 ` Jon Hunter
2025-11-12 14:08 ` Eric Dumazet
4 siblings, 1 reply; 30+ messages in thread
From: Jon Hunter @ 2025-11-12 14:03 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, linux-tegra@vger.kernel.org
Hi Eric,
On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.
Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
r8169 0000:01:00.0: enabling device (0140 -> 0143)
r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...
r8169 0000:01:00.0 enp1s0: Link is Down
tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
OOM killer enabled.
Restarting tasks: Starting
Restarting tasks: Done
random: crng reseeded on system resumption
PM: suspend exit
ata1: SATA link down (SStatus 0 SControl 300)
r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.
Any thoughts?
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 14:03 ` Jon Hunter
@ 2025-11-12 14:08 ` Eric Dumazet
2025-11-12 15:26 ` Jon Hunter
0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-12 14:08 UTC (permalink / raw)
To: Jon Hunter
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On Wed, Nov 12, 2025 at 6:03 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Eric,
>
> On 06/11/2025 20:29, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
> >
> > 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> > 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> > 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> > 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> > 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> > 2.11% swapper [kernel.kallsyms] [k] __slab_free
> > 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> > 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> > 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> > 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> > 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> > 0.93% swapper [kernel.kallsyms] [k] read_tsc
> > 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> > 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> > 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> > 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> > 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> > 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> > 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> > 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> > 0.48% swapper [kernel.kallsyms] [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
> >
> > 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> > 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> > 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> > 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> > 2.73% swapper [kernel.kallsyms] [k] read_tsc
> > 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> > 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> > 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> > 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> > 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> > 0.53% swapper [kernel.kallsyms] [k] io_idle
> > 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> > 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> > 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> > 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> > 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> > 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> > 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> > 0.32% swapper [kernel.kallsyms] [k] dql_completed
> > 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> > 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> > 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> > 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> > 0.28% swapper [kernel.kallsyms] [k] ktime_get
> > 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > net/core/skbuff.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > + skb_release_head_state(skb);
> > + return skb_attempt_defer_free(skb);
> > + }
> > +
> > if (!skb_unref(skb))
> > return;
> >
>
> I have noticed a suspend regression on one of our Tegra boards. Bisect
> is pointing to this commit and reverting this on top of -next fixes the
> issue.
>
> Out of all the Tegra boards we test only one is failing and that is the
> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>
> r8169 0000:01:00.0: enabling device (0140 -> 0143)
> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>
> I don't see any particular crash or error, and even after resuming from
> suspend the link does come up ...
>
> r8169 0000:01:00.0 enp1s0: Link is Down
> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> OOM killer enabled.
> Restarting tasks: Starting
> Restarting tasks: Done
> random: crng reseeded on system resumption
> PM: suspend exit
> ata1: SATA link down (SStatus 0 SControl 300)
> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>
> However, the board does not seem to resume fully. One thing I should
> point out is that for testing we always use an NFS rootfs. So this
> would indicate that the link comes up but networking is still having
> issues.
>
> Any thoughts?
>
> Jon
Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
Thanks !
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 14:08 ` Eric Dumazet
@ 2025-11-12 15:26 ` Jon Hunter
2025-11-12 15:32 ` Eric Dumazet
0 siblings, 1 reply; 30+ messages in thread
From: Jon Hunter @ 2025-11-12 15:26 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On 12/11/2025 14:08, Eric Dumazet wrote:
...
>> I have noticed a suspend regression on one of our Tegra boards. Bisect
>> is pointing to this commit and reverting this on top of -next fixes the
>> issue.
>>
>> Out of all the Tegra boards we test only one is failing and that is the
>> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>>
>> r8169 0000:01:00.0: enabling device (0140 -> 0143)
>> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
>> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>>
>> I don't see any particular crash or error, and even after resuming from
>> suspend the link does come up ...
>>
>> r8169 0000:01:00.0 enp1s0: Link is Down
>> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
>> OOM killer enabled.
>> Restarting tasks: Starting
>> Restarting tasks: Done
>> random: crng reseeded on system resumption
>> PM: suspend exit
>> ata1: SATA link down (SStatus 0 SControl 300)
>> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>
>> However, the board does not seem to resume fully. One thing I should
>> point out is that for testing we always use an NFS rootfs. So this
>> would indicate that the link comes up but networking is still having
>> issues.
>>
>> Any thoughts?
>>
>> Jon
>
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
That does indeed fix it. Feel free to add my ...
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Thanks
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
2025-11-12 15:26 ` Jon Hunter
@ 2025-11-12 15:32 ` Eric Dumazet
0 siblings, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-12 15:32 UTC (permalink / raw)
To: Jon Hunter
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
linux-tegra@vger.kernel.org
On Wed, Nov 12, 2025 at 7:27 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 12/11/2025 14:08, Eric Dumazet wrote:
>
> ...
>
> >> I have noticed a suspend regression on one of our Tegra boards. Bisect
> >> is pointing to this commit and reverting this on top of -next fixes the
> >> issue.
> >>
> >> Out of all the Tegra boards we test only one is failing and that is the
> >> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
> >>
> >> r8169 0000:01:00.0: enabling device (0140 -> 0143)
> >> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> >> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
> >>
> >> I don't see any particular crash or error, and even after resuming from
> >> suspend the link does come up ...
> >>
> >> r8169 0000:01:00.0 enp1s0: Link is Down
> >> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> >> OOM killer enabled.
> >> Restarting tasks: Starting
> >> Restarting tasks: Done
> >> random: crng reseeded on system resumption
> >> PM: suspend exit
> >> ata1: SATA link down (SStatus 0 SControl 300)
> >> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
> >>
> >> However, the board does not seem to resume fully. One thing I should
> >> point out is that for testing we always use an NFS rootfs. So this
> >> would indicate that the link comes up but networking is still having
> >> issues.
> >>
> >> Any thoughts?
> >>
> >> Jon
> >
> > Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
>
>
> That does indeed fix it. Feel free to add my ...
>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>
Thanks for testing. Note the patch was merged already in net-next, so
we can not add your tag.
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Nov 11 15:12:35 2025 +0000
net: clear skb->sk in skb_release_head_state()
skb_release_head_state() inlines skb_orphan().
We need to clear skb->sk otherwise we can freeze TCP flows
on a mostly idle host, because skb_fclone_busy() would
return true as long as the packet is not yet processed by
skb_defer_free_flush().
Fixes: 1fcf572211da ("net: allow skb_release_head_state() to be
called multiple times")
Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251111151235.1903659-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-11-12 15:33 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07 6:42 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 12:28 ` Eric Dumazet
2025-11-07 14:34 ` Toke Høiland-Jørgensen
2025-11-07 15:44 ` Jason Xing
2025-11-11 16:56 ` Aditya Garg
2025-11-11 17:17 ` Eric Dumazet
2025-11-12 5:14 ` Aditya Garg
2025-11-12 14:03 ` Jon Hunter
2025-11-12 14:08 ` Eric Dumazet
2025-11-12 15:26 ` Jon Hunter
2025-11-12 15:32 ` Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07 6:47 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 15:37 ` Jason Xing
2025-11-07 15:47 ` Eric Dumazet
2025-11-07 15:49 ` Jason Xing
2025-11-07 16:00 ` Eric Dumazet
2025-11-07 16:03 ` Jason Xing
2025-11-07 16:08 ` Eric Dumazet
2025-11-07 16:20 ` Jason Xing
2025-11-07 16:26 ` Eric Dumazet
2025-11-07 16:59 ` Jason Xing
2025-11-08 3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).