From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Greear Subject: Re: Performance regression between 4.13 and 4.14 Date: Wed, 9 May 2018 12:02:34 -0700 Message-ID: References: <9115910b-dd8b-e7f3-be53-f739b8382032@candelatech.com> <17a364e0-89c8-d4f6-3873-353c7dae4fba@candelatech.com> <988a00db-bab3-7318-5e31-2a7c0948262a@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit To: Eric Dumazet , netdev Return-path: Received: from mail2.candelatech.com ([208.74.158.173]:37156 "EHLO mail2.candelatech.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964961AbeEITCh (ORCPT ); Wed, 9 May 2018 15:02:37 -0400 In-Reply-To: <988a00db-bab3-7318-5e31-2a7c0948262a@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: On 05/09/2018 11:48 AM, Eric Dumazet wrote: > > > On 05/09/2018 11:43 AM, Ben Greear wrote: >> On 05/08/2018 10:10 AM, Eric Dumazet wrote: >>> >>> >>> On 05/08/2018 09:44 AM, Ben Greear wrote: >>>> Hello, >>>> >>>> I am trying to track down a performance regression that appears to be between 4.13 >>>> and 4.14. >>>> >>>> I first saw the problem with a hacked version of pktgen on some ixgbe NICs. 4.13 can do >>>> right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps. >>>> >>>> I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx >>>> on 4.14 and about 4.4Gbps on 4.13. >>>> >>>> Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits >>>> crash or do not bring up networking, which makes the bisect difficult. >>>> >>>> Looking at perf top, it would appear that some lock is probably to blame. >>> >>> >>> perf record -a -g -e cycles:pp sleep 5 >>> perf report >>> >>> Then you'll be able to tell us which lock (or call graph) is killing your perf. >>> >> >> I seem to be chasing multiple issues. For 4.13, at least part of my problem was that LOCKDEP was enabled, >> during my bisect, though it does NOT appear enabled in 4.16. I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING >> in 4.16, or something like that? My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it: >> >> [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config >> CONFIG_LOCKDEP_SUPPORT=y >> >> >> For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need >> to disable to keep from getting a performance hit from the spectre-related bug >> fixes? At this point, I do not care about the security implications. >> >> greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config >> # CONFIG_RETPOLINE is not set >> >> >> Thanks, >> Ben >> > > No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/ I initially saw the problem in 4.16, then bisected, and 4.14 still showed the issue. 4.13 works, but only when I use a .config I originally built for 4.13, not the 4.16 .config that I ended up using with the bisect (make oldconfig, accept all defaults). I originally configured 4.16 with a .config that had lockdep enabled, then manually tried to disable it through 'make xconfig'. I think that must leave "CONFIG_LOCKDEP=y" in the .config, which screws up older builds during bisect, perhaps? > Before doing a (painful) dissection, the perf output would immediately tell you if > something is really wrong on your .config. I didn't realize lockdep might be an issue at the time, but here is a 'bad' run from a 4.13+ (plus pktgen hacks). I guess lockdep is why this runs slowly, but I see no obvious proof of that in the output: 4.13+, patched pktgen, 6Gbps throughput, on commit 906dde0f355bd97c080c215811ae7db1137c4af8 Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736 Children Self Command Shared Object Symbol + 87.97% 0.00% kpktgend_1 [kernel.kallsyms] [k] ret_from_fork + 87.97% 0.00% kpktgend_1 [kernel.kallsyms] [k] kthread + 86.89% 5.42% kpktgend_1 [kernel.kallsyms] [k] pktgen_thread_worker + 33.75% 0.18% kpktgend_1 [kernel.kallsyms] [k] getnstimeofday64 + 32.77% 4.47% kpktgend_1 [kernel.kallsyms] [k] __getnstimeofday64 + 24.60% 10.91% kpktgend_1 [kernel.kallsyms] [k] lock_acquire + 23.59% 0.03% kpktgend_1 [kernel.kallsyms] [k] __do_softirq + 23.55% 0.07% kpktgend_1 [kernel.kallsyms] [k] net_rx_action + 22.29% 0.47% kpktgend_1 [kernel.kallsyms] [k] getRelativeCurNs + 21.33% 1.71% kpktgend_1 [kernel.kallsyms] [k] ixgbe_poll + 15.79% 0.02% kpktgend_1 [kernel.kallsyms] [k] ret_from_intr + 15.78% 0.01% kpktgend_1 [kernel.kallsyms] [k] do_IRQ + 15.34% 0.01% kpktgend_1 [kernel.kallsyms] [k] irq_exit + 13.95% 10.00% kpktgend_1 [kernel.kallsyms] [k] ip_send_check + 13.80% 13.80% kpktgend_1 [kernel.kallsyms] [k] __lock_acquire.isra.31 + 12.98% 0.53% kpktgend_1 [kernel.kallsyms] [k] pktgen_finalize_skb + 12.31% 0.20% kpktgend_1 [kernel.kallsyms] [k] timestamp_skb.isra.24 + 11.68% 0.13% kpktgend_1 [kernel.kallsyms] [k] napi_gro_receive + 11.36% 0.25% kpktgend_1 [kernel.kallsyms] [k] netif_receive_skb_internal + 10.93% 0.00% swapper [kernel.kallsyms] [k] verify_cpu + 10.93% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 10.92% 0.02% swapper [kernel.kallsyms] [k] do_idle + 10.71% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter + 10.71% 0.00% swapper [kernel.kallsyms] [k] call_cpuidle + 10.66% 0.06% swapper [kernel.kallsyms] [k] cpuidle_enter_state + 9.78% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel + 9.78% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations + 9.78% 0.00% swapper [kernel.kallsyms] [k] start_kernel + 9.78% 0.00% swapper [kernel.kallsyms] [k] rest_init + 9.25% 0.00% swapper [kernel.kallsyms] [k] ret_from_intr + 9.25% 0.00% swapper [kernel.kallsyms] [k] do_IRQ + 9.24% 0.30% kpktgend_1 [kernel.kallsyms] [k] pktgen_alloc_skb + 9.14% 0.00% swapper [kernel.kallsyms] [k] irq_exit + 8.89% 0.00% swapper [kernel.kallsyms] [k] __do_softirq + 8.82% 0.01% swapper [kernel.kallsyms] [k] net_rx_action + 8.80% 0.47% kpktgend_1 [kernel.kallsyms] [k] __local_bh_enable_ip + 8.67% 6.87% kpktgend_1 [kernel.kallsyms] [k] lock_release + 8.57% 1.10% kpktgend_1 [kernel.kallsyms] [k] __netdev_alloc_skb + 8.28% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_softirq.part.17 + 8.28% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_softirq_own_stack + 8.01% 0.74% swapper [kernel.kallsyms] [k] ixgbe_poll + 6.85% 2.28% kpktgend_1 [kernel.kallsyms] [k] __build_skb + 4.89% 0.07% kpktgend_1 [kernel.kallsyms] [k] __netif_receive_skb + 4.87% 0.75% kpktgend_1 [kernel.kallsyms] [k] __netif_receive_skb_core + 4.66% 0.16% kpktgend_1 [kernel.kallsyms] [k] _raw_spin_lock + 4.37% 0.06% swapper [kernel.kallsyms] [k] napi_gro_receive + 4.23% 0.10% swapper [kernel.kallsyms] [k] netif_receive_skb_internal Here is a 'bad' run of 4.16.0 + my full out-of-tree patches, including my normal pktgen changes (pktgen code should be nearly identical between the run above and the run here) Samples: 26K of event 'cycles:pp', Event count (approx.): 19888249841 Children Self Command Shared Object Symbol + 74.24% 5.38% kpktgend_1 [kernel.kallsyms] [k] pktgen_thread_worker + 73.89% 0.00% kpktgend_1 [kernel.kallsyms] [k] ret_from_fork + 73.89% 0.00% kpktgend_1 [kernel.kallsyms] [k] kthread + 27.31% 0.25% kpktgend_1 [kernel.kallsyms] [k] getnstimeofday64 + 26.15% 4.24% kpktgend_1 [kernel.kallsyms] [k] __getnstimeofday64 + 22.64% 0.02% kpktgend_1 [kernel.kallsyms] [k] __softirqentry_text_s + 22.62% 0.06% kpktgend_1 [kernel.kallsyms] [k] net_rx_action + 22.59% 2.04% kpktgend_1 [kernel.kallsyms] [k] ixgbe_poll + 21.49% 7.92% kpktgend_1 [kernel.kallsyms] [k] lock_acquire + 17.78% 17.78% kpktgend_1 [kernel.kallsyms] [k] __lock_acquire.isra.3 + 15.95% 0.39% kpktgend_1 [kernel.kallsyms] [k] getRelativeCurNs + 14.49% 0.02% kpktgend_1 [kernel.kallsyms] [k] ret_from_intr + 14.48% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_IRQ + 14.35% 0.01% kpktgend_1 [kernel.kallsyms] [k] irq_exit + 13.46% 9.90% kpktgend_1 [kernel.kallsyms] [k] ip_send_check + 11.36% 0.18% kpktgend_1 [kernel.kallsyms] [k] napi_gro_receive + 11.25% 0.17% kpktgend_1 [kernel.kallsyms] [k] netif_receive_skb_int + 10.37% 0.16% kpktgend_1 [kernel.kallsyms] [k] timestamp_skb.isra.25 + 8.85% 0.45% kpktgend_1 [kernel.kallsyms] [k] __local_bh_enable_ip + 8.61% 0.02% swapper [kernel.kallsyms] [k] do_idle + 8.60% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64 + 8.60% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 8.37% 0.05% swapper [kernel.kallsyms] [k] cpuidle_enter_state + 8.35% 7.57% kpktgend_1 [kernel.kallsyms] [k] lock_release + 8.32% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_softirq.part.16 + 8.31% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_softirq_own_stack + 7.16% 0.36% kpktgend_1 [kernel.kallsyms] [k] pktgen_alloc_skb + 7.09% 1.14% kpktgend_1 [kernel.kallsyms] [k] __netdev_alloc_skb + 6.87% 0.00% swapper [kernel.kallsyms] [k] start_kernel + 6.55% 0.00% swapper [kernel.kallsyms] [k] ret_from_intr + 6.55% 0.01% swapper [kernel.kallsyms] [k] do_IRQ + 6.51% 0.01% swapper [kernel.kallsyms] [k] irq_exit + 6.34% 0.02% swapper [kernel.kallsyms] [k] __softirqentry_text_s + 6.25% 0.01% swapper [kernel.kallsyms] [k] net_rx_action + 6.06% 0.45% swapper [kernel.kallsyms] [k] ixgbe_poll + 5.35% 2.11% kpktgend_1 [kernel.kallsyms] [k] __build_skb I tried testing with stock kernels and UDP user-space traffic, but results varied widely from run to run, so it was useless for bisecting. I have irqbalance disabled on this system in order to pin pktgen process and ixgbe IRQs to be efficient, so maybe that is why user-space UDP traffic works so unreliably. My hacked pktgen reliably shows good on my good 4.13 kernel and reliably bad on higher kernels I have compiled so far, so I decided it was the only useful way to bisect this problem. I will now try 4.14 and higher again with lockdep definitely disabled. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com