netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb()
@ 2025-11-06 20:29 Eric Dumazet
  2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.

Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.

Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.

This removes contention on SLUB spinlocks and data structures,
and this makes sure that recycled sk_buff have correct NUMA locality.

After this series, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).

I will later refactor skb_attempt_defer_free()
to no longer have to care of skb_shared() and skb_release_head_state().

Eric Dumazet (3):
  net: allow skb_release_head_state() to be called multiple times
  net: fix napi_consume_skb() with alien skbs
  net: increase skb_defer_max default to 128

 Documentation/admin-guide/sysctl/net.rst |  4 ++--
 net/core/hotdata.c                       |  2 +-
 net/core/skbuff.c                        | 12 ++++++++----
 3 files changed, 11 insertions(+), 7 deletions(-)

-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
  2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
  2025-11-07  6:42   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Currently, only skb dst is cleared (thanks to skb_dst_drop())

Make sure skb->destructor, conntrack and extensions are cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b4bc8b1c7d5674c19b64f8b15685d74632048fe..eeddb9e737ff28e47c77739db7b25ea68e5aa735 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1149,11 +1149,10 @@ void skb_release_head_state(struct sk_buff *skb)
 				skb);
 
 #endif
+		skb->destructor = NULL;
 	}
-#if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	nf_conntrack_put(skb_nfct(skb));
-#endif
-	skb_ext_put(skb);
+	nf_reset_ct(skb);
+	skb_ext_reset(skb);
 }
 
 /* Free everything but the sk_buff shell. */
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
  2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
  2025-11-07  6:46   ` Kuniyuki Iwashima
                     ` (4 more replies)
  2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
  2025-11-08  3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
  3 siblings, 5 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.

Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.

Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.

This removes contention on SLUB spinlocks and data structures.

After this patch, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).

80 Mpps -> 120 Mpps.

Profiling one of the 32 cpus servicing NIC interrupts :

Before:

mpstat -P 511 1 1

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00

    31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
    12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
     3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
     3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
     2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
     2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
     2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
     2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
     2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
     2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
     2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
     1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
     1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
     1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
     1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
     1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
     1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
     0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
     0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
     0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
     0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
     0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
     0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
     0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
     0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
     0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
     0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
     0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
     0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree

After:

mpstat -P 511 1 1

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51

    19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
    13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
    10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
    10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
     5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
     3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
     3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
     2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
     2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
     1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
     1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
     0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
     0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
     0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
     0.53%  swapper          [kernel.kallsyms]  [k] io_idle
     0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
     0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
     0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
     0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
     0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
     0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
     0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
     0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
     0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
     0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
     0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
     0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
     0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
     0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
 
 	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
 
+	if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
+		skb_release_head_state(skb);
+		return skb_attempt_defer_free(skb);
+	}
+
 	if (!skb_unref(skb))
 		return;
 
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
  2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
@ 2025-11-06 20:29 ` Eric Dumazet
  2025-11-07  6:47   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2025-11-08  3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
  3 siblings, 3 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-06 20:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

skb_defer_max value is very conservative, and can be increased
to avoid too many calls to kick_defer_list_purge().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/admin-guide/sysctl/net.rst | 4 ++--
 net/core/hotdata.c                       | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 991773dcb9cfe57f64bffabc018549b712aed9b0..369a738a68193e897d880eeb2c5a22cd90833938 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -355,9 +355,9 @@ skb_defer_max
 -------------
 
 Max size (in skbs) of the per-cpu list of skbs being freed
-by the cpu which allocated them. Used by TCP stack so far.
+by the cpu which allocated them.
 
-Default: 64
+Default: 128
 
 optmem_max
 ----------
diff --git a/net/core/hotdata.c b/net/core/hotdata.c
index 95d0a4df10069e4529fb9e5b58e8391574085cf1..dddd5c287cf08ba75aec1cc546fd1bc48c0f7b26 100644
--- a/net/core/hotdata.c
+++ b/net/core/hotdata.c
@@ -20,7 +20,7 @@ struct net_hotdata net_hotdata __cacheline_aligned = {
 	.dev_tx_weight = 64,
 	.dev_rx_weight = 64,
 	.sysctl_max_skb_frags = MAX_SKB_FRAGS,
-	.sysctl_skb_defer_max = 64,
+	.sysctl_skb_defer_max = 128,
 	.sysctl_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE
 };
 EXPORT_SYMBOL(net_hotdata);
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
  2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
@ 2025-11-07  6:42   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07  6:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Willem de Bruijn, netdev, eric.dumazet

On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Currently, only skb dst is cleared (thanks to skb_dst_drop())
>
> Make sure skb->destructor, conntrack and extensions are cleared.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
@ 2025-11-07  6:46   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07  6:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Willem de Bruijn, netdev, eric.dumazet

On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>
>     31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>     12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>      3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>      2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>      2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>      2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>      2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>      2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>      2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>      1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>      1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>      0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>      0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>      0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>      0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>      0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>      0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>      0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>      0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>      0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>
>     19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>     13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>     10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>     10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>      5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>      2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>      2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>      0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>      0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>      0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>      0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>      0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>      0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>      0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>      0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>      0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>      0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>      0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>      0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>      0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>      0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>      0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>      0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>      0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>      0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Big improvement again!

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
@ 2025-11-07  6:47   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
  2025-11-07 15:37   ` Jason Xing
  2 siblings, 0 replies; 30+ messages in thread
From: Kuniyuki Iwashima @ 2025-11-07  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Willem de Bruijn, netdev, eric.dumazet

On Thu, Nov 6, 2025 at 12:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times
  2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
  2025-11-07  6:42   ` Kuniyuki Iwashima
@ 2025-11-07 11:23   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> Currently, only skb dst is cleared (thanks to skb_dst_drop())
>
> Make sure skb->destructor, conntrack and extensions are cleared.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
  2025-11-07  6:46   ` Kuniyuki Iwashima
@ 2025-11-07 11:23   ` Toke Høiland-Jørgensen
  2025-11-07 12:28     ` Eric Dumazet
  2025-11-07 15:44   ` Jason Xing
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>
>     31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>     12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>      3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>      2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>      2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>      2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>      2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>      2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>      2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>      1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>      1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>      0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>      0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>      0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>      0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>      0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>      0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>      0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>      0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>      0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>
>     19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>     13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>     10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>     10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>      5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>      2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>      2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>      0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>      0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>      0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>      0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>      0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>      0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>      0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>      0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>      0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>      0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>      0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>      0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>      0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>      0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>      0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>      0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>      0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>      0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Impressive!

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
  2025-11-07  6:47   ` Kuniyuki Iwashima
@ 2025-11-07 11:23   ` Toke Høiland-Jørgensen
  2025-11-07 15:37   ` Jason Xing
  2 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 11:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
@ 2025-11-07 12:28     ` Eric Dumazet
  2025-11-07 14:34       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 12:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Impressive!
>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

Thanks !

Note that my upcoming plan is also to plumb skb_attempt_defer_free()
into __kfree_skb().

[ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]

TCP under pressure would benefit from this a _lot_.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-07 12:28     ` Eric Dumazet
@ 2025-11-07 14:34       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 14:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Fri, Nov 7, 2025 at 3:23 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Impressive!
>>
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Thanks !
>
> Note that my upcoming plan is also to plumb skb_attempt_defer_free()
> into __kfree_skb().
>
> [ Part of a refactor, puting skb_unref() in skb_attempt_defer_free() ]
>
> TCP under pressure would benefit from this a _lot_.

Interesting; look forward to seeing the results :)

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
  2025-11-07  6:47   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
@ 2025-11-07 15:37   ` Jason Xing
  2025-11-07 15:47     ` Eric Dumazet
  2 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> skb_defer_max value is very conservative, and can be increased
> to avoid too many calls to kick_defer_list_purge().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
well since the freeing skb happens in the softirq context, which I
came up with when I was doing the optimization for af_xdp. That is
also used to defer freeing skb to obtain some improvement in
performance. I'd like to know your opinion on this, thanks in advance!

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
  2025-11-07  6:46   ` Kuniyuki Iwashima
  2025-11-07 11:23   ` Toke Høiland-Jørgensen
@ 2025-11-07 15:44   ` Jason Xing
  2025-11-11 16:56   ` Aditya Garg
  2025-11-12 14:03   ` Jon Hunter
  4 siblings, 0 replies; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>
>     31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>     12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>      3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>      2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>      2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>      2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>      2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>      2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>      2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>      1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>      1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>      0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>      0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>      0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>      0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>      0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>      0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>      0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>      0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>      0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>
>     19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>     13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>     10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>     10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>      5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>      2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>      2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>      0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>      0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>      0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>      0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>      0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>      0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>      0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>      0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>      0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>      0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>      0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>      0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>      0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>      0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>      0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>      0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>      0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>      0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks for your brilliant work that really gives me so much
inspiration one more time.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 15:37   ` Jason Xing
@ 2025-11-07 15:47     ` Eric Dumazet
  2025-11-07 15:49       ` Jason Xing
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 15:47 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > skb_defer_max value is very conservative, and can be increased
> > to avoid too many calls to kick_defer_list_purge().
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> well since the freeing skb happens in the softirq context, which I
> came up with when I was doing the optimization for af_xdp. That is
> also used to defer freeing skb to obtain some improvement in
> performance. I'd like to know your opinion on this, thanks in advance!

Makes sense. I even had a patch like this in my queue ;)

>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 15:47     ` Eric Dumazet
@ 2025-11-07 15:49       ` Jason Xing
  2025-11-07 16:00         ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 15:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > skb_defer_max value is very conservative, and can be increased
> > > to avoid too many calls to kick_defer_list_purge().
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > well since the freeing skb happens in the softirq context, which I
> > came up with when I was doing the optimization for af_xdp. That is
> > also used to defer freeing skb to obtain some improvement in
> > performance. I'd like to know your opinion on this, thanks in advance!
>
> Makes sense. I even had a patch like this in my queue ;)

Great to hear that. Look forward to seeing it soon :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 15:49       ` Jason Xing
@ 2025-11-07 16:00         ` Eric Dumazet
  2025-11-07 16:03           ` Jason Xing
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:00 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > skb_defer_max value is very conservative, and can be increased
> > > > to avoid too many calls to kick_defer_list_purge().
> > > >
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >
> > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > well since the freeing skb happens in the softirq context, which I
> > > came up with when I was doing the optimization for af_xdp. That is
> > > also used to defer freeing skb to obtain some improvement in
> > > performance. I'd like to know your opinion on this, thanks in advance!
> >
> > Makes sense. I even had a patch like this in my queue ;)
>
> Great to hear that. Look forward to seeing it soon :)

Oh please go ahead !

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 16:00         ` Eric Dumazet
@ 2025-11-07 16:03           ` Jason Xing
  2025-11-07 16:08             ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > skb_defer_max value is very conservative, and can be increased
> > > > > to avoid too many calls to kick_defer_list_purge().
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > >
> > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > well since the freeing skb happens in the softirq context, which I
> > > > came up with when I was doing the optimization for af_xdp. That is
> > > > also used to defer freeing skb to obtain some improvement in
> > > > performance. I'd like to know your opinion on this, thanks in advance!
> > >
> > > Makes sense. I even had a patch like this in my queue ;)
> >
> > Great to hear that. Look forward to seeing it soon :)
>
> Oh please go ahead !

Okay, thanks for letting me post this minor change. I just thought you
wanted to do this on your own :P

Will do it soon :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 16:03           ` Jason Xing
@ 2025-11-07 16:08             ` Eric Dumazet
  2025-11-07 16:20               ` Jason Xing
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:08 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > >
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > well since the freeing skb happens in the softirq context, which I
> > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > also used to defer freeing skb to obtain some improvement in
> > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > >
> > > > Makes sense. I even had a patch like this in my queue ;)
> > >
> > > Great to hear that. Look forward to seeing it soon :)
> >
> > Oh please go ahead !
>
> Okay, thanks for letting me post this minor change. I just thought you
> wanted to do this on your own :P
>
> Will do it soon :)

Note that I was thinking to free only 32 skbs if we fill up the array
completely.

Current code frees half of it, this seems better trying to keep 96
skbs and free 32 of them.

Same for the bulk alloc, we could probably go to 32 (instead of 16)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 16:08             ` Eric Dumazet
@ 2025-11-07 16:20               ` Jason Xing
  2025-11-07 16:26                 ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > >
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > >
> > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > >
> > > > > Makes sense. I even had a patch like this in my queue ;)
> > > >
> > > > Great to hear that. Look forward to seeing it soon :)
> > >
> > > Oh please go ahead !
> >
> > Okay, thanks for letting me post this minor change. I just thought you
> > wanted to do this on your own :P
> >
> > Will do it soon :)
>
> Note that I was thinking to free only 32 skbs if we fill up the array
> completely.
>
> Current code frees half of it, this seems better trying to keep 96
> skbs and free 32 of them.
>
> Same for the bulk alloc, we could probably go to 32 (instead of 16)

Thanks for your suggestion!

However, sorry, I didn't get it totally. I'm wondering what the
difference between freeing only 32 and freeing half of the new value
is? My thought was freeing the half, say, 128/2, which minimizes more
times of performing skb free functions. Could you shed some light on
those numbers?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 16:20               ` Jason Xing
@ 2025-11-07 16:26                 ` Eric Dumazet
  2025-11-07 16:59                   ` Jason Xing
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-07 16:26 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 8:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > >
> > > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > > >
> > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > >
> > > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > > >
> > > > > > Makes sense. I even had a patch like this in my queue ;)
> > > > >
> > > > > Great to hear that. Look forward to seeing it soon :)
> > > >
> > > > Oh please go ahead !
> > >
> > > Okay, thanks for letting me post this minor change. I just thought you
> > > wanted to do this on your own :P
> > >
> > > Will do it soon :)
> >
> > Note that I was thinking to free only 32 skbs if we fill up the array
> > completely.
> >
> > Current code frees half of it, this seems better trying to keep 96
> > skbs and free 32 of them.
> >
> > Same for the bulk alloc, we could probably go to 32 (instead of 16)
>
> Thanks for your suggestion!
>
> However, sorry, I didn't get it totally. I'm wondering what the
> difference between freeing only 32 and freeing half of the new value
> is? My thought was freeing the half, say, 128/2, which minimizes more
> times of performing skb free functions. Could you shed some light on
> those numbers?

If we free half, a subsequent net_rx_action() calling 2 or 3 times a
napi_poll() will exhaust the remaining 64.

This is a per-cpu reserve of sk_buff (256 bytes each). I think we can
afford having them in the pool just in case...

I also had a prefetch() in napi_skb_cache_get() :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03..3d40c4b0c580afc183c30e2efb0f953d0d5aabf9
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -297,6 +297,8 @@ static struct sk_buff *napi_skb_cache_get(void)
        }

        skb = nc->skb_cache[--nc->skb_count];
+       if (nc->skb_count)
+               prefetch(nc->skb_cache[nc->skb_count - 1]);
        local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
        kasan_mempool_unpoison_object(skb, skbuff_cache_size);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net: increase skb_defer_max default to 128
  2025-11-07 16:26                 ` Eric Dumazet
@ 2025-11-07 16:59                   ` Jason Xing
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Xing @ 2025-11-07 16:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Sat, Nov 8, 2025 at 12:26 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 12:08 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 7, 2025 at 8:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Sat, Nov 8, 2025 at 12:00 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Fri, Nov 7, 2025 at 7:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 7, 2025 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 7, 2025 at 7:37 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Nov 7, 2025 at 4:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > > >
> > > > > > > > > skb_defer_max value is very conservative, and can be increased
> > > > > > > > > to avoid too many calls to kick_defer_list_purge().
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > > >
> > > > > > > > I was thinking if we ought to enlarge NAPI_SKB_CACHE_SIZE() to 128 as
> > > > > > > > well since the freeing skb happens in the softirq context, which I
> > > > > > > > came up with when I was doing the optimization for af_xdp. That is
> > > > > > > > also used to defer freeing skb to obtain some improvement in
> > > > > > > > performance. I'd like to know your opinion on this, thanks in advance!
> > > > > > >
> > > > > > > Makes sense. I even had a patch like this in my queue ;)
> > > > > >
> > > > > > Great to hear that. Look forward to seeing it soon :)
> > > > >
> > > > > Oh please go ahead !
> > > >
> > > > Okay, thanks for letting me post this minor change. I just thought you
> > > > wanted to do this on your own :P
> > > >
> > > > Will do it soon :)
> > >
> > > Note that I was thinking to free only 32 skbs if we fill up the array
> > > completely.
> > >
> > > Current code frees half of it, this seems better trying to keep 96
> > > skbs and free 32 of them.
> > >
> > > Same for the bulk alloc, we could probably go to 32 (instead of 16)
> >
> > Thanks for your suggestion!
> >
> > However, sorry, I didn't get it totally. I'm wondering what the
> > difference between freeing only 32 and freeing half of the new value
> > is? My thought was freeing the half, say, 128/2, which minimizes more
> > times of performing skb free functions. Could you shed some light on
> > those numbers?
>
> If we free half, a subsequent net_rx_action() calling 2 or 3 times a
> napi_poll() will exhaust the remaining 64.
>
> This is a per-cpu reserve of sk_buff (256 bytes each). I think we can
> afford having them in the pool just in case...

I think I understand what you meant:
1) Freeing half of that (which is even though more than 32) doesn't
make a big deal as it will be consumed in few rounds. Thus, we don't
need to adjust the policy of freeing skbs.
2) Enlarging the volume of this pool makes more sense. It will then be
increased from 32 to 64.
3) Also increase NAPI_SKB_CACHE_BULK to 32 in accordance with the above updates.

If I'm missing something, please point out the direction I should take :)

>
> I also had a prefetch() in napi_skb_cache_get() :
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03..3d40c4b0c580afc183c30e2efb0f953d0d5aabf9
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -297,6 +297,8 @@ static struct sk_buff *napi_skb_cache_get(void)
>         }
>
>         skb = nc->skb_cache[--nc->skb_count];
> +       if (nc->skb_count)
> +               prefetch(nc->skb_cache[nc->skb_count - 1]);
>         local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
>         kasan_mempool_unpoison_object(skb, skbuff_cache_size);

Interesting. Thanks for your suggestion. I think I will include this :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb()
  2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
@ 2025-11-08  3:10 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 30+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-08  3:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, kuniyu, willemb, netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu,  6 Nov 2025 20:29:32 +0000 you wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
> 
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
> 
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
> 
> [...]

Here is the summary with links:
  - [net-next,1/3] net: allow skb_release_head_state() to be called multiple times
    https://git.kernel.org/netdev/net-next/c/1fcf572211da
  - [net-next,2/3] net: fix napi_consume_skb() with alien skbs
    https://git.kernel.org/netdev/net-next/c/e20dfbad8aab
  - [net-next,3/3] net: increase skb_defer_max default to 128
    https://git.kernel.org/netdev/net-next/c/b61785852ed0

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
                     ` (2 preceding siblings ...)
  2025-11-07 15:44   ` Jason Xing
@ 2025-11-11 16:56   ` Aditya Garg
  2025-11-11 17:17     ` Eric Dumazet
  2025-11-12 14:03   ` Jon Hunter
  4 siblings, 1 reply; 30+ messages in thread
From: Aditya Garg @ 2025-11-11 16:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	ssengar, gargaditya

On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
> 
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
> 
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
> 
> This removes contention on SLUB spinlocks and data structures.
> 
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> 
> 80 Mpps -> 120 Mpps.
> 
> Profiling one of the 32 cpus servicing NIC interrupts :
> 
> Before:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> 
>     31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>     12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>      3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>      2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>      2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>      2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>      2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>      2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>      2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>      1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>      1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>      0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>      0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>      0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>      0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>      0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>      0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>      0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>      0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>      0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> 
> After:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> 
>     19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>     13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>     10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>     10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>      7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>      5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>      3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>      3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>      2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>      2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>      1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>      1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>      0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>      0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>      0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>      0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>      0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>      0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>      0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>      0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>      0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>      0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>      0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>      0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>      0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>      0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>      0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>      0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>      0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>      0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/skbuff.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>  
>  	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>  
> +	if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> +		skb_release_head_state(skb);
> +		return skb_attempt_defer_free(skb);
> +	}
> +
>  	if (!skb_unref(skb))
>  		return;
>  
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
>

I ran these tests on latest net-next for MANA driver and I am observing a regression here.

lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[  5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.57 GBytes  39.2 Gbits/sec  586   1.04 MBytes
[  5]   1.00-2.00   sec  4.74 GBytes  40.7 Gbits/sec  520   1.13 MBytes
[  5]   2.00-3.00   sec  5.16 GBytes  44.3 Gbits/sec  191   1.20 MBytes
[  5]   3.00-4.00   sec  5.13 GBytes  44.1 Gbits/sec  520   1.11 MBytes
[  5]   4.00-5.00   sec   678 MBytes  5.69 Gbits/sec   93   1.37 KBytes
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
[  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec  1910             sender
[  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec                  receiver

iperf Done.


I tested again by reverting this patch and regression was not there.

lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
Connecting to host 10.0.0.4, port 5201
[  5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.95 GBytes  42.5 Gbits/sec  541   1.10 MBytes
[  5]   1.00-2.00   sec  4.92 GBytes  42.3 Gbits/sec  599    878 KBytes
[  5]   2.00-3.00   sec  4.51 GBytes  38.7 Gbits/sec  438    803 KBytes
[  5]   3.00-4.00   sec  4.69 GBytes  40.3 Gbits/sec  647   1.17 MBytes
[  5]   4.00-5.00   sec  4.18 GBytes  35.9 Gbits/sec  1183    715 KBytes
[  5]   5.00-6.00   sec  5.05 GBytes  43.4 Gbits/sec  484    975 KBytes
[  5]   6.00-7.00   sec  5.32 GBytes  45.7 Gbits/sec  520    836 KBytes
[  5]   7.00-8.00   sec  5.29 GBytes  45.5 Gbits/sec  436   1.10 MBytes
[  5]   8.00-9.00   sec  5.27 GBytes  45.2 Gbits/sec  464   1.30 MBytes
[  5]   9.00-10.00  sec  5.25 GBytes  45.1 Gbits/sec  425   1.13 MBytes
[  5]  10.00-11.00  sec  5.29 GBytes  45.4 Gbits/sec  268   1.19 MBytes
[  5]  11.00-12.00  sec  4.98 GBytes  42.8 Gbits/sec  711    793 KBytes
[  5]  12.00-13.00  sec  3.80 GBytes  32.6 Gbits/sec  1255    801 KBytes
[  5]  13.00-14.00  sec  3.80 GBytes  32.7 Gbits/sec  1130    642 KBytes
[  5]  14.00-15.00  sec  4.31 GBytes  37.0 Gbits/sec  1024   1.11 MBytes
[  5]  15.00-16.00  sec  5.18 GBytes  44.5 Gbits/sec  359   1.25 MBytes
[  5]  16.00-17.00  sec  5.23 GBytes  44.9 Gbits/sec  265    900 KBytes
[  5]  17.00-18.00  sec  4.70 GBytes  40.4 Gbits/sec  769    715 KBytes
[  5]  18.00-19.00  sec  3.77 GBytes  32.4 Gbits/sec  1841    889 KBytes
[  5]  19.00-20.00  sec  3.77 GBytes  32.4 Gbits/sec  1084    827 KBytes
[  5]  20.00-21.00  sec  5.01 GBytes  43.0 Gbits/sec  558    994 KBytes
[  5]  21.00-22.00  sec  5.27 GBytes  45.3 Gbits/sec  450   1.25 MBytes
[  5]  22.00-23.00  sec  5.25 GBytes  45.1 Gbits/sec  338   1.18 MBytes
[  5]  23.00-24.00  sec  5.29 GBytes  45.4 Gbits/sec  200   1.14 MBytes
[  5]  24.00-25.00  sec  5.29 GBytes  45.5 Gbits/sec  518   1.02 MBytes
[  5]  25.00-26.00  sec  4.28 GBytes  36.7 Gbits/sec  1258    792 KBytes
[  5]  26.00-27.00  sec  3.87 GBytes  33.2 Gbits/sec  1365    799 KBytes
[  5]  27.00-28.00  sec  4.77 GBytes  41.0 Gbits/sec  530   1.09 MBytes
[  5]  28.00-29.00  sec  5.31 GBytes  45.6 Gbits/sec  419   1.06 MBytes
[  5]  29.00-30.00  sec  5.32 GBytes  45.7 Gbits/sec  222   1.10 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec  20301             sender
[  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec                  receiver

iperf Done.


I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!

Regards,
Aditya 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-11 16:56   ` Aditya Garg
@ 2025-11-11 17:17     ` Eric Dumazet
  2025-11-12  5:14       ` Aditya Garg
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-11 17:17 UTC (permalink / raw)
  To: Aditya Garg
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	ssengar, gargaditya

On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
<gargaditya@linux.microsoft.com> wrote:
>
> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> >
> >     31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >     12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >      5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
> >      3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >      3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >      2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >      2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
> >      2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
> >      2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
> >      2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
> >      2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
> >      2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
> >      2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >      1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
> >      1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
> >      1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >      1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >      1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >      1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >      1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >      0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
> >      0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
> >      0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
> >      0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >      0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
> >      0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >      0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
> >      0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
> >      0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
> >      0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
> >      0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
> >      0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> >
> >     19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >     13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >     10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
> >     10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >      7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >      6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >      5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >      3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >      3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
> >      2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
> >      2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >      1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >      1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
> >      0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
> >      0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
> >      0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
> >      0.53%  swapper          [kernel.kallsyms]  [k] io_idle
> >      0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
> >      0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
> >      0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
> >      0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
> >      0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
> >      0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
> >      0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
> >      0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
> >      0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
> >      0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
> >      0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
> >      0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
> >      0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
> >      0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/skbuff.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> >       DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > +             skb_release_head_state(skb);
> > +             return skb_attempt_defer_free(skb);
> > +     }
> > +
> >       if (!skb_unref(skb))
> >               return;
> >
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >
>
> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>
> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [  5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.57 GBytes  39.2 Gbits/sec  586   1.04 MBytes
> [  5]   1.00-2.00   sec  4.74 GBytes  40.7 Gbits/sec  520   1.13 MBytes
> [  5]   2.00-3.00   sec  5.16 GBytes  44.3 Gbits/sec  191   1.20 MBytes
> [  5]   3.00-4.00   sec  5.13 GBytes  44.1 Gbits/sec  520   1.11 MBytes
> [  5]   4.00-5.00   sec   678 MBytes  5.69 Gbits/sec   93   1.37 KBytes
> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> [  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec  1910             sender
> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec                  receiver
>
> iperf Done.
>
>
> I tested again by reverting this patch and regression was not there.
>
> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
> Connecting to host 10.0.0.4, port 5201
> [  5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.95 GBytes  42.5 Gbits/sec  541   1.10 MBytes
> [  5]   1.00-2.00   sec  4.92 GBytes  42.3 Gbits/sec  599    878 KBytes
> [  5]   2.00-3.00   sec  4.51 GBytes  38.7 Gbits/sec  438    803 KBytes
> [  5]   3.00-4.00   sec  4.69 GBytes  40.3 Gbits/sec  647   1.17 MBytes
> [  5]   4.00-5.00   sec  4.18 GBytes  35.9 Gbits/sec  1183    715 KBytes
> [  5]   5.00-6.00   sec  5.05 GBytes  43.4 Gbits/sec  484    975 KBytes
> [  5]   6.00-7.00   sec  5.32 GBytes  45.7 Gbits/sec  520    836 KBytes
> [  5]   7.00-8.00   sec  5.29 GBytes  45.5 Gbits/sec  436   1.10 MBytes
> [  5]   8.00-9.00   sec  5.27 GBytes  45.2 Gbits/sec  464   1.30 MBytes
> [  5]   9.00-10.00  sec  5.25 GBytes  45.1 Gbits/sec  425   1.13 MBytes
> [  5]  10.00-11.00  sec  5.29 GBytes  45.4 Gbits/sec  268   1.19 MBytes
> [  5]  11.00-12.00  sec  4.98 GBytes  42.8 Gbits/sec  711    793 KBytes
> [  5]  12.00-13.00  sec  3.80 GBytes  32.6 Gbits/sec  1255    801 KBytes
> [  5]  13.00-14.00  sec  3.80 GBytes  32.7 Gbits/sec  1130    642 KBytes
> [  5]  14.00-15.00  sec  4.31 GBytes  37.0 Gbits/sec  1024   1.11 MBytes
> [  5]  15.00-16.00  sec  5.18 GBytes  44.5 Gbits/sec  359   1.25 MBytes
> [  5]  16.00-17.00  sec  5.23 GBytes  44.9 Gbits/sec  265    900 KBytes
> [  5]  17.00-18.00  sec  4.70 GBytes  40.4 Gbits/sec  769    715 KBytes
> [  5]  18.00-19.00  sec  3.77 GBytes  32.4 Gbits/sec  1841    889 KBytes
> [  5]  19.00-20.00  sec  3.77 GBytes  32.4 Gbits/sec  1084    827 KBytes
> [  5]  20.00-21.00  sec  5.01 GBytes  43.0 Gbits/sec  558    994 KBytes
> [  5]  21.00-22.00  sec  5.27 GBytes  45.3 Gbits/sec  450   1.25 MBytes
> [  5]  22.00-23.00  sec  5.25 GBytes  45.1 Gbits/sec  338   1.18 MBytes
> [  5]  23.00-24.00  sec  5.29 GBytes  45.4 Gbits/sec  200   1.14 MBytes
> [  5]  24.00-25.00  sec  5.29 GBytes  45.5 Gbits/sec  518   1.02 MBytes
> [  5]  25.00-26.00  sec  4.28 GBytes  36.7 Gbits/sec  1258    792 KBytes
> [  5]  26.00-27.00  sec  3.87 GBytes  33.2 Gbits/sec  1365    799 KBytes
> [  5]  27.00-28.00  sec  4.77 GBytes  41.0 Gbits/sec  530   1.09 MBytes
> [  5]  28.00-29.00  sec  5.31 GBytes  45.6 Gbits/sec  419   1.06 MBytes
> [  5]  29.00-30.00  sec  5.32 GBytes  45.7 Gbits/sec  222   1.10 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec  20301             sender
> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec                  receiver
>
> iperf Done.
>
>
> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!

Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/

Thanks !

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-11 17:17     ` Eric Dumazet
@ 2025-11-12  5:14       ` Aditya Garg
  0 siblings, 0 replies; 30+ messages in thread
From: Aditya Garg @ 2025-11-12  5:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	ssengar, gargaditya

On 11-11-2025 22:47, Eric Dumazet wrote:
> On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
> <gargaditya@linux.microsoft.com> wrote:
>>
>> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
>>> There is a lack of NUMA awareness and more generally lack
>>> of slab caches affinity on TX completion path.
>>>
>>> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
>>> in per-cpu caches so that they can be recycled in RX path.
>>>
>>> Only use this if the skb was allocated on the same cpu,
>>> otherwise use skb_attempt_defer_free() so that the skb
>>> is freed on the original cpu.
>>>
>>> This removes contention on SLUB spinlocks and data structures.
>>>
>>> After this patch, I get ~50% improvement for an UDP tx workload
>>> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>>>
>>> 80 Mpps -> 120 Mpps.
>>>
>>> Profiling one of the 32 cpus servicing NIC interrupts :
>>>
>>> Before:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>>>
>>>      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>>>       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>>>       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>>>       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>>>       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>>>       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>>>       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>>>       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>>>       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>>>       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>>>       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>>>       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>>>       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>>>       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>>>       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>>>       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>>>       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>
>>> After:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>>>
>>>      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>>>      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>>>       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>>>       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>>>       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>>>       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>>>       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>>>       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>>>       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>>>       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>>>       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>>>       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>>>       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>>>       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>>>       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>>>       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>>>       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>>>       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>>>       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>>>       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>>   net/core/skbuff.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>>>
>>>        DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>>>
>>> +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
>>> +             skb_release_head_state(skb);
>>> +             return skb_attempt_defer_free(skb);
>>> +     }
>>> +
>>>        if (!skb_unref(skb))
>>>                return;
>>>
>>> --
>>> 2.51.2.1041.gc1ab5b90ca-goog
>>>
>>
>> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>>
>> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.57 GBytes  39.2 Gbits/sec  586   1.04 MBytes
>> [  5]   1.00-2.00   sec  4.74 GBytes  40.7 Gbits/sec  520   1.13 MBytes
>> [  5]   2.00-3.00   sec  5.16 GBytes  44.3 Gbits/sec  191   1.20 MBytes
>> [  5]   3.00-4.00   sec  5.13 GBytes  44.1 Gbits/sec  520   1.11 MBytes
>> [  5]   4.00-5.00   sec   678 MBytes  5.69 Gbits/sec   93   1.37 KBytes
>> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec  1910             sender
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I tested again by reverting this patch and regression was not there.
>>
>> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.95 GBytes  42.5 Gbits/sec  541   1.10 MBytes
>> [  5]   1.00-2.00   sec  4.92 GBytes  42.3 Gbits/sec  599    878 KBytes
>> [  5]   2.00-3.00   sec  4.51 GBytes  38.7 Gbits/sec  438    803 KBytes
>> [  5]   3.00-4.00   sec  4.69 GBytes  40.3 Gbits/sec  647   1.17 MBytes
>> [  5]   4.00-5.00   sec  4.18 GBytes  35.9 Gbits/sec  1183    715 KBytes
>> [  5]   5.00-6.00   sec  5.05 GBytes  43.4 Gbits/sec  484    975 KBytes
>> [  5]   6.00-7.00   sec  5.32 GBytes  45.7 Gbits/sec  520    836 KBytes
>> [  5]   7.00-8.00   sec  5.29 GBytes  45.5 Gbits/sec  436   1.10 MBytes
>> [  5]   8.00-9.00   sec  5.27 GBytes  45.2 Gbits/sec  464   1.30 MBytes
>> [  5]   9.00-10.00  sec  5.25 GBytes  45.1 Gbits/sec  425   1.13 MBytes
>> [  5]  10.00-11.00  sec  5.29 GBytes  45.4 Gbits/sec  268   1.19 MBytes
>> [  5]  11.00-12.00  sec  4.98 GBytes  42.8 Gbits/sec  711    793 KBytes
>> [  5]  12.00-13.00  sec  3.80 GBytes  32.6 Gbits/sec  1255    801 KBytes
>> [  5]  13.00-14.00  sec  3.80 GBytes  32.7 Gbits/sec  1130    642 KBytes
>> [  5]  14.00-15.00  sec  4.31 GBytes  37.0 Gbits/sec  1024   1.11 MBytes
>> [  5]  15.00-16.00  sec  5.18 GBytes  44.5 Gbits/sec  359   1.25 MBytes
>> [  5]  16.00-17.00  sec  5.23 GBytes  44.9 Gbits/sec  265    900 KBytes
>> [  5]  17.00-18.00  sec  4.70 GBytes  40.4 Gbits/sec  769    715 KBytes
>> [  5]  18.00-19.00  sec  3.77 GBytes  32.4 Gbits/sec  1841    889 KBytes
>> [  5]  19.00-20.00  sec  3.77 GBytes  32.4 Gbits/sec  1084    827 KBytes
>> [  5]  20.00-21.00  sec  5.01 GBytes  43.0 Gbits/sec  558    994 KBytes
>> [  5]  21.00-22.00  sec  5.27 GBytes  45.3 Gbits/sec  450   1.25 MBytes
>> [  5]  22.00-23.00  sec  5.25 GBytes  45.1 Gbits/sec  338   1.18 MBytes
>> [  5]  23.00-24.00  sec  5.29 GBytes  45.4 Gbits/sec  200   1.14 MBytes
>> [  5]  24.00-25.00  sec  5.29 GBytes  45.5 Gbits/sec  518   1.02 MBytes
>> [  5]  25.00-26.00  sec  4.28 GBytes  36.7 Gbits/sec  1258    792 KBytes
>> [  5]  26.00-27.00  sec  3.87 GBytes  33.2 Gbits/sec  1365    799 KBytes
>> [  5]  27.00-28.00  sec  4.77 GBytes  41.0 Gbits/sec  530   1.09 MBytes
>> [  5]  28.00-29.00  sec  5.31 GBytes  45.6 Gbits/sec  419   1.06 MBytes
>> [  5]  29.00-30.00  sec  5.32 GBytes  45.7 Gbits/sec  222   1.10 MBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec  20301             sender
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
> 
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
> 
> Thanks !

Thanks Eric, it works fine after above fix.

Regards,
Aditya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
                     ` (3 preceding siblings ...)
  2025-11-11 16:56   ` Aditya Garg
@ 2025-11-12 14:03   ` Jon Hunter
  2025-11-12 14:08     ` Eric Dumazet
  4 siblings, 1 reply; 30+ messages in thread
From: Jon Hunter @ 2025-11-12 14:03 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, linux-tegra@vger.kernel.org

Hi Eric,

On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
> 
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
> 
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
> 
> This removes contention on SLUB spinlocks and data structures.
> 
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> 
> 80 Mpps -> 120 Mpps.
> 
> Profiling one of the 32 cpus servicing NIC interrupts :
> 
> Before:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> 
>      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> 
> After:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> 
>      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>   net/core/skbuff.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>   
>   	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>   
> +	if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> +		skb_release_head_state(skb);
> +		return skb_attempt_defer_free(skb);
> +	}
> +
>   	if (!skb_unref(skb))
>   		return;
>   

I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.

Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...

  r8169 0000:01:00.0: enabling device (0140 -> 0143)
  r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
  r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]

I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...

  r8169 0000:01:00.0 enp1s0: Link is Down
  tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
  OOM killer enabled.
  Restarting tasks: Starting
  Restarting tasks: Done
  random: crng reseeded on system resumption
  PM: suspend exit
  ata1: SATA link down (SStatus 0 SControl 300)
  r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx

However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.

Any thoughts?

Jon
  
-- 
nvpublic


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-12 14:03   ` Jon Hunter
@ 2025-11-12 14:08     ` Eric Dumazet
  2025-11-12 15:26       ` Jon Hunter
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2025-11-12 14:08 UTC (permalink / raw)
  To: Jon Hunter
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	linux-tegra@vger.kernel.org

On Wed, Nov 12, 2025 at 6:03 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Eric,
>
> On 06/11/2025 20:29, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> >
> >      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
> >       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
> >       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
> >       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
> >       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
> >       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
> >       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
> >       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
> >       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
> >       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
> >       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
> >       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
> >       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
> >       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
> >       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
> >       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
> >       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
> >       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
> >       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> >
> >      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
> >      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
> >       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
> >       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
> >       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
> >       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
> >       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
> >       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
> >       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
> >       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
> >       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
> >       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
> >       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
> >       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
> >       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
> >       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
> >       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
> >       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
> >       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
> >       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
> >       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
> >       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >   net/core/skbuff.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> >       DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > +             skb_release_head_state(skb);
> > +             return skb_attempt_defer_free(skb);
> > +     }
> > +
> >       if (!skb_unref(skb))
> >               return;
> >
>
> I have noticed a suspend regression on one of our Tegra boards. Bisect
> is pointing to this commit and reverting this on top of -next fixes the
> issue.
>
> Out of all the Tegra boards we test only one is failing and that is the
> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>
>   r8169 0000:01:00.0: enabling device (0140 -> 0143)
>   r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
>   r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>
> I don't see any particular crash or error, and even after resuming from
> suspend the link does come up ...
>
>   r8169 0000:01:00.0 enp1s0: Link is Down
>   tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
>   OOM killer enabled.
>   Restarting tasks: Starting
>   Restarting tasks: Done
>   random: crng reseeded on system resumption
>   PM: suspend exit
>   ata1: SATA link down (SStatus 0 SControl 300)
>   r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>
> However, the board does not seem to resume fully. One thing I should
> point out is that for testing we always use an NFS rootfs. So this
> would indicate that the link comes up but networking is still having
> issues.
>
> Any thoughts?
>
> Jon

Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/

Thanks !

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-12 14:08     ` Eric Dumazet
@ 2025-11-12 15:26       ` Jon Hunter
  2025-11-12 15:32         ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Hunter @ 2025-11-12 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	linux-tegra@vger.kernel.org


On 12/11/2025 14:08, Eric Dumazet wrote:

...

>> I have noticed a suspend regression on one of our Tegra boards. Bisect
>> is pointing to this commit and reverting this on top of -next fixes the
>> issue.
>>
>> Out of all the Tegra boards we test only one is failing and that is the
>> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>>
>>    r8169 0000:01:00.0: enabling device (0140 -> 0143)
>>    r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
>>    r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>>
>> I don't see any particular crash or error, and even after resuming from
>> suspend the link does come up ...
>>
>>    r8169 0000:01:00.0 enp1s0: Link is Down
>>    tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
>>    OOM killer enabled.
>>    Restarting tasks: Starting
>>    Restarting tasks: Done
>>    random: crng reseeded on system resumption
>>    PM: suspend exit
>>    ata1: SATA link down (SStatus 0 SControl 300)
>>    r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>
>> However, the board does not seem to resume fully. One thing I should
>> point out is that for testing we always use an NFS rootfs. So this
>> would indicate that the link comes up but networking is still having
>> issues.
>>
>> Any thoughts?
>>
>> Jon
> 
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/


That does indeed fix it. Feel free to add my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
  2025-11-12 15:26       ` Jon Hunter
@ 2025-11-12 15:32         ` Eric Dumazet
  0 siblings, 0 replies; 30+ messages in thread
From: Eric Dumazet @ 2025-11-12 15:32 UTC (permalink / raw)
  To: Jon Hunter
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	linux-tegra@vger.kernel.org

On Wed, Nov 12, 2025 at 7:27 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 12/11/2025 14:08, Eric Dumazet wrote:
>
> ...
>
> >> I have noticed a suspend regression on one of our Tegra boards. Bisect
> >> is pointing to this commit and reverting this on top of -next fixes the
> >> issue.
> >>
> >> Out of all the Tegra boards we test only one is failing and that is the
> >> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
> >>
> >>    r8169 0000:01:00.0: enabling device (0140 -> 0143)
> >>    r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> >>    r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
> >>
> >> I don't see any particular crash or error, and even after resuming from
> >> suspend the link does come up ...
> >>
> >>    r8169 0000:01:00.0 enp1s0: Link is Down
> >>    tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> >>    OOM killer enabled.
> >>    Restarting tasks: Starting
> >>    Restarting tasks: Done
> >>    random: crng reseeded on system resumption
> >>    PM: suspend exit
> >>    ata1: SATA link down (SStatus 0 SControl 300)
> >>    r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
> >>
> >> However, the board does not seem to resume fully. One thing I should
> >> point out is that for testing we always use an NFS rootfs. So this
> >> would indicate that the link comes up but networking is still having
> >> issues.
> >>
> >> Any thoughts?
> >>
> >> Jon
> >
> > Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
>
>
> That does indeed fix it. Feel free to add my ...
>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks for testing. Note the patch was merged already in net-next, so
we can not add your tag.

Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 11 15:12:35 2025 +0000

    net: clear skb->sk in skb_release_head_state()

    skb_release_head_state() inlines skb_orphan().

    We need to clear skb->sk otherwise we can freeze TCP flows
    on a mostly idle host, because skb_fclone_busy() would
    return true as long as the packet is not yet processed by
    skb_defer_free_flush().

    Fixes: 1fcf572211da ("net: allow skb_release_head_state() to be
called multiple times")
    Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Tested-by: Aditya Garg <gargaditya@linux.microsoft.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
    Link: https://patch.msgid.link/20251111151235.1903659-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-11-12 15:33 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07  6:42   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07  6:46   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 12:28     ` Eric Dumazet
2025-11-07 14:34       ` Toke Høiland-Jørgensen
2025-11-07 15:44   ` Jason Xing
2025-11-11 16:56   ` Aditya Garg
2025-11-11 17:17     ` Eric Dumazet
2025-11-12  5:14       ` Aditya Garg
2025-11-12 14:03   ` Jon Hunter
2025-11-12 14:08     ` Eric Dumazet
2025-11-12 15:26       ` Jon Hunter
2025-11-12 15:32         ` Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07  6:47   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 15:37   ` Jason Xing
2025-11-07 15:47     ` Eric Dumazet
2025-11-07 15:49       ` Jason Xing
2025-11-07 16:00         ` Eric Dumazet
2025-11-07 16:03           ` Jason Xing
2025-11-07 16:08             ` Eric Dumazet
2025-11-07 16:20               ` Jason Xing
2025-11-07 16:26                 ` Eric Dumazet
2025-11-07 16:59                   ` Jason Xing
2025-11-08  3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).