From: Jon Hunter <jonathanh@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
"David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>,
Kuniyuki Iwashima <kuniyu@google.com>,
Willem de Bruijn <willemb@google.com>,
netdev@vger.kernel.org, eric.dumazet@gmail.com,
"linux-tegra@vger.kernel.org" <linux-tegra@vger.kernel.org>
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
Date: Wed, 12 Nov 2025 14:03:30 +0000 [thread overview]
Message-ID: <7460a188-3a74-4336-ae03-c88e21ffc1ca@nvidia.com> (raw)
In-Reply-To: <20251106202935.1776179-3-edumazet@google.com>
Hi Eric,
On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.
Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
r8169 0000:01:00.0: enabling device (0140 -> 0143)
r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...
r8169 0000:01:00.0 enp1s0: Link is Down
tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
OOM killer enabled.
Restarting tasks: Starting
Restarting tasks: Done
random: crng reseeded on system resumption
PM: suspend exit
ata1: SATA link down (SStatus 0 SControl 300)
r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.
Any thoughts?
Jon
--
nvpublic
next prev parent reply other threads:[~2025-11-12 14:03 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07 6:42 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07 6:46 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 12:28 ` Eric Dumazet
2025-11-07 14:34 ` Toke Høiland-Jørgensen
2025-11-07 15:44 ` Jason Xing
2025-11-11 16:56 ` Aditya Garg
2025-11-11 17:17 ` Eric Dumazet
2025-11-12 5:14 ` Aditya Garg
2025-11-12 14:03 ` Jon Hunter [this message]
2025-11-12 14:08 ` Eric Dumazet
2025-11-12 15:26 ` Jon Hunter
2025-11-12 15:32 ` Eric Dumazet
2026-03-09 21:00 ` Long delay in freeing skbs Tony Battersby
2026-03-09 21:07 ` Eric Dumazet
2026-03-09 21:18 ` Tony Battersby
2026-03-09 21:24 ` Eric Dumazet
2026-03-09 21:47 ` Tony Battersby
2026-03-10 14:49 ` Tony Battersby
2026-03-10 15:25 ` Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07 6:47 ` Kuniyuki Iwashima
2025-11-07 11:23 ` Toke Høiland-Jørgensen
2025-11-07 15:37 ` Jason Xing
2025-11-07 15:47 ` Eric Dumazet
2025-11-07 15:49 ` Jason Xing
2025-11-07 16:00 ` Eric Dumazet
2025-11-07 16:03 ` Jason Xing
2025-11-07 16:08 ` Eric Dumazet
2025-11-07 16:20 ` Jason Xing
2025-11-07 16:26 ` Eric Dumazet
2025-11-07 16:59 ` Jason Xing
2025-11-08 3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7460a188-3a74-4336-ae03-c88e21ffc1ca@nvidia.com \
--to=jonathanh@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-tegra@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.