All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aditya Garg <gargaditya@linux.microsoft.com>
To: Eric Dumazet <edumazet@google.com>
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	Willem de Bruijn <willemb@google.com>,
	netdev@vger.kernel.org, eric.dumazet@gmail.com,
	ssengar@linux.microsoft.com, gargaditya@microsoft.com
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
Date: Wed, 12 Nov 2025 10:44:28 +0530	[thread overview]
Message-ID: <c6e3a182-b326-4a6d-901d-c445d95643eb@linux.microsoft.com> (raw)
In-Reply-To: <CANn89i+e2XDw4b_iHaj_LPTeg6M2+-+vroECks21Fs7Hfg2Aow@mail.gmail.com>

On 11-11-2025 22:47, Eric Dumazet wrote:
> On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
> <gargaditya@linux.microsoft.com> wrote:
>>
>> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
>>> There is a lack of NUMA awareness and more generally lack
>>> of slab caches affinity on TX completion path.
>>>
>>> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
>>> in per-cpu caches so that they can be recycled in RX path.
>>>
>>> Only use this if the skb was allocated on the same cpu,
>>> otherwise use skb_attempt_defer_free() so that the skb
>>> is freed on the original cpu.
>>>
>>> This removes contention on SLUB spinlocks and data structures.
>>>
>>> After this patch, I get ~50% improvement for an UDP tx workload
>>> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>>>
>>> 80 Mpps -> 120 Mpps.
>>>
>>> Profiling one of the 32 cpus servicing NIC interrupts :
>>>
>>> Before:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>>>
>>>      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>>>       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>>>       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>>>       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>>>       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>>>       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>>>       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>>>       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>>>       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>>>       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>>>       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>>>       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>>>       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>>>       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>>>       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>>>       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>>>       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>
>>> After:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>>>
>>>      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>>>      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>>>       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>>>       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>>>       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>>>       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>>>       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>>>       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>>>       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>>>       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>>>       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>>>       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>>>       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>>>       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>>>       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>>>       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>>>       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>>>       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>>>       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>>>       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>>   net/core/skbuff.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>>>
>>>        DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>>>
>>> +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
>>> +             skb_release_head_state(skb);
>>> +             return skb_attempt_defer_free(skb);
>>> +     }
>>> +
>>>        if (!skb_unref(skb))
>>>                return;
>>>
>>> --
>>> 2.51.2.1041.gc1ab5b90ca-goog
>>>
>>
>> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>>
>> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.57 GBytes  39.2 Gbits/sec  586   1.04 MBytes
>> [  5]   1.00-2.00   sec  4.74 GBytes  40.7 Gbits/sec  520   1.13 MBytes
>> [  5]   2.00-3.00   sec  5.16 GBytes  44.3 Gbits/sec  191   1.20 MBytes
>> [  5]   3.00-4.00   sec  5.13 GBytes  44.1 Gbits/sec  520   1.11 MBytes
>> [  5]   4.00-5.00   sec   678 MBytes  5.69 Gbits/sec   93   1.37 KBytes
>> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec  1910             sender
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I tested again by reverting this patch and regression was not there.
>>
>> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.95 GBytes  42.5 Gbits/sec  541   1.10 MBytes
>> [  5]   1.00-2.00   sec  4.92 GBytes  42.3 Gbits/sec  599    878 KBytes
>> [  5]   2.00-3.00   sec  4.51 GBytes  38.7 Gbits/sec  438    803 KBytes
>> [  5]   3.00-4.00   sec  4.69 GBytes  40.3 Gbits/sec  647   1.17 MBytes
>> [  5]   4.00-5.00   sec  4.18 GBytes  35.9 Gbits/sec  1183    715 KBytes
>> [  5]   5.00-6.00   sec  5.05 GBytes  43.4 Gbits/sec  484    975 KBytes
>> [  5]   6.00-7.00   sec  5.32 GBytes  45.7 Gbits/sec  520    836 KBytes
>> [  5]   7.00-8.00   sec  5.29 GBytes  45.5 Gbits/sec  436   1.10 MBytes
>> [  5]   8.00-9.00   sec  5.27 GBytes  45.2 Gbits/sec  464   1.30 MBytes
>> [  5]   9.00-10.00  sec  5.25 GBytes  45.1 Gbits/sec  425   1.13 MBytes
>> [  5]  10.00-11.00  sec  5.29 GBytes  45.4 Gbits/sec  268   1.19 MBytes
>> [  5]  11.00-12.00  sec  4.98 GBytes  42.8 Gbits/sec  711    793 KBytes
>> [  5]  12.00-13.00  sec  3.80 GBytes  32.6 Gbits/sec  1255    801 KBytes
>> [  5]  13.00-14.00  sec  3.80 GBytes  32.7 Gbits/sec  1130    642 KBytes
>> [  5]  14.00-15.00  sec  4.31 GBytes  37.0 Gbits/sec  1024   1.11 MBytes
>> [  5]  15.00-16.00  sec  5.18 GBytes  44.5 Gbits/sec  359   1.25 MBytes
>> [  5]  16.00-17.00  sec  5.23 GBytes  44.9 Gbits/sec  265    900 KBytes
>> [  5]  17.00-18.00  sec  4.70 GBytes  40.4 Gbits/sec  769    715 KBytes
>> [  5]  18.00-19.00  sec  3.77 GBytes  32.4 Gbits/sec  1841    889 KBytes
>> [  5]  19.00-20.00  sec  3.77 GBytes  32.4 Gbits/sec  1084    827 KBytes
>> [  5]  20.00-21.00  sec  5.01 GBytes  43.0 Gbits/sec  558    994 KBytes
>> [  5]  21.00-22.00  sec  5.27 GBytes  45.3 Gbits/sec  450   1.25 MBytes
>> [  5]  22.00-23.00  sec  5.25 GBytes  45.1 Gbits/sec  338   1.18 MBytes
>> [  5]  23.00-24.00  sec  5.29 GBytes  45.4 Gbits/sec  200   1.14 MBytes
>> [  5]  24.00-25.00  sec  5.29 GBytes  45.5 Gbits/sec  518   1.02 MBytes
>> [  5]  25.00-26.00  sec  4.28 GBytes  36.7 Gbits/sec  1258    792 KBytes
>> [  5]  26.00-27.00  sec  3.87 GBytes  33.2 Gbits/sec  1365    799 KBytes
>> [  5]  27.00-28.00  sec  4.77 GBytes  41.0 Gbits/sec  530   1.09 MBytes
>> [  5]  28.00-29.00  sec  5.31 GBytes  45.6 Gbits/sec  419   1.06 MBytes
>> [  5]  29.00-30.00  sec  5.32 GBytes  45.7 Gbits/sec  222   1.10 MBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec  20301             sender
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
> 
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
> 
> Thanks !

Thanks Eric, it works fine after above fix.

Regards,
Aditya

  reply	other threads:[~2025-11-12  5:14 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07  6:42   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07  6:46   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 12:28     ` Eric Dumazet
2025-11-07 14:34       ` Toke Høiland-Jørgensen
2025-11-07 15:44   ` Jason Xing
2025-11-11 16:56   ` Aditya Garg
2025-11-11 17:17     ` Eric Dumazet
2025-11-12  5:14       ` Aditya Garg [this message]
2025-11-12 14:03   ` Jon Hunter
2025-11-12 14:08     ` Eric Dumazet
2025-11-12 15:26       ` Jon Hunter
2025-11-12 15:32         ` Eric Dumazet
2026-03-09 21:00   ` Long delay in freeing skbs Tony Battersby
2026-03-09 21:07     ` Eric Dumazet
2026-03-09 21:18       ` Tony Battersby
2026-03-09 21:24         ` Eric Dumazet
2026-03-09 21:47           ` Tony Battersby
2026-03-10 14:49             ` Tony Battersby
2026-03-10 15:25               ` Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07  6:47   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 15:37   ` Jason Xing
2025-11-07 15:47     ` Eric Dumazet
2025-11-07 15:49       ` Jason Xing
2025-11-07 16:00         ` Eric Dumazet
2025-11-07 16:03           ` Jason Xing
2025-11-07 16:08             ` Eric Dumazet
2025-11-07 16:20               ` Jason Xing
2025-11-07 16:26                 ` Eric Dumazet
2025-11-07 16:59                   ` Jason Xing
2025-11-08  3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6e3a182-b326-4a6d-901d-c445d95643eb@linux.microsoft.com \
    --to=gargaditya@linux.microsoft.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=gargaditya@microsoft.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=ssengar@linux.microsoft.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.