From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH v2 net-next] net: sched: run ingress qdisc without locks Date: Mon, 4 May 2015 13:04:05 +0200 Message-ID: <20150504130405.3ff6672e@redhat.com> References: <1430544448-19777-1-git-send-email-ast@plumgrid.com> <20150503174208.5b1548ba@redhat.com> <5546FFCB.50903@plumgrid.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: "David S. Miller" , John Fastabend , Jamal Hadi Salim , Daniel Borkmann , netdev@vger.kernel.org, brouer@redhat.com To: Alexei Starovoitov Return-path: Received: from mx1.redhat.com ([209.132.183.28]:36281 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752317AbbEDLEM (ORCPT ); Mon, 4 May 2015 07:04:12 -0400 In-Reply-To: <5546FFCB.50903@plumgrid.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sun, 03 May 2015 22:12:43 -0700 Alexei Starovoitov wrote: > On 5/3/15 8:42 AM, Jesper Dangaard Brouer wrote: > > > > I was actually expecting to see a higher performance boost. > > improvement diff = -2.85 ns > ... > > The patch is removing two atomic operations, spin_{un,}lock, which I > > have benchmarked[1] to cost approx 14ns on my system. Your system > > likely is faster, but not that much (p.s. benchmark your own system > > with [1]) > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > have tried you tight loop spin_lock test on my box and it showed: > time_bench: Type:spin_lock_unlock Per elem: 40 cycles(tsc) 11.070 ns > and yet the total single cpu gain from removal of spin_lock/unlock > in ingress path is smaller than 11ns. I think this observation is > telling us that tight loop benchmarking is inherently flawed. > I'm guessing that uops that cmpxchg is broken into can execute in > parallel with uops of other insns, so tight loops of the same sequence > of uops has more alu dependencies whereas in more normal insn flow > these uops can mix and match better. Would be great if intel microarch > experts can chime in. How do you activate the ingress code path? I'm just doing (is this enough?): export DEV=eth4 tc qdisc add dev $DEV handle ffff: ingress I re-ran the experiment, and I can also only show a 2.68ns improvement. This is rather strange, and I cannot explain it. The lock clearly shows up in perf report[1] with 12.23% raw_spin_lock, and perf report[2] it clearly gone, but we don't see a 12% improvement in performance, but around 4.7%. Before activating qdisc ingress code : 25.3Mpps (25398057) Activating qdisc ingress with lock : 16.9Mpps (16989315) Activating qdisc ingress without lock: 17.8Mpps (17800496) (1/17800496*10^9)-(1/16989315*10^9) = -2.68 ns The "cost" of activating the ingress qdisc is also interesting: (1/25398057*10^9)-(1/16989315*10^9) = -19.49 ns (1/25398057*10^9)-(1/17800496*10^9) = -16.81 ns -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer My setup * Tested on top of commit 4749c3ef854 * gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) * CPU E5-2695(ES) @ 2.8GHz [1] perf report with ingress qlock Samples: 2K of event 'cycles', Event count (approx.): 1762298819 Overhead Command Shared Object Symbol + 35.86% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core + 17.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb + 12.23% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock - _raw_spin_lock + 93.54% __netif_receive_skb_core + 6.46% __netif_receive_skb + 5.45% kpktgend_0 [sch_ingress] [k] ingress_enqueue + 4.65% kpktgend_0 [pktgen] [k] pktgen_thread_worker + 4.23% kpktgend_0 [kernel.vmlinux] [k] ip_rcv + 3.95% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat + 3.71% kpktgend_0 [kernel.vmlinux] [k] tc_classify + 3.03% kpktgend_0 [kernel.vmlinux] [k] netif_receive_skb_internal + 2.65% kpktgend_0 [kernel.vmlinux] [k] netif_receive_skb_sk + 1.97% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb + 0.71% kpktgend_0 [kernel.vmlinux] [k] __local_bh_enable_ip + 0.28% kpktgend_0 [kernel.vmlinux] [k] kthread_should_stop [2] perf report without ingress qlock Samples: 2K of event 'cycles', Event count (approx.): 1633499063 Overhead Command Shared Object Symbol + 39.29% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core + 19.24% kpktgend_0 [kernel.vmlinux] [k] kfree_skb + 11.05% kpktgend_0 [sch_ingress] [k] ingress_enqueue + 4.69% kpktgend_0 [kernel.vmlinux] [k] tc_classify + 4.48% kpktgend_0 [kernel.vmlinux] [k] ip_rcv + 4.43% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat + 4.19% kpktgend_0 [pktgen] [k] pktgen_thread_worker + 3.50% kpktgend_0 [kernel.vmlinux] [k] netif_receive_skb_internal + 2.61% kpktgend_0 [kernel.vmlinux] [k] netif_receive_skb_sk + 2.26% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb + 0.43% kpktgend_0 [kernel.vmlinux] [k] __local_bh_enable_ip + 0.13% swapper [kernel.vmlinux] [k] mwait_idle