From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Date: Mon, 17 Nov 2008 12:20:59 +0100 Message-ID: <4921539B.2000002@cosmosbay.com> References: <1ScKicKnTUE.A.VxH.DIHIJB@chimera> <20081117090648.GG28786@elte.hu> <20081117.011403.06989342.davem@davemloft.net> <20081117110119.GL28786@elte.hu> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20081117110119.GL28786@elte.hu> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="iso-8859-1"; format="flowed" To: Ingo Molnar Cc: David Miller , rjw@sisk.pl, linux-kernel@vger.kernel.org, kernel-testers@vger.kernel.org, cl@linux-foundation.org, efault@gmx.de, a.p.zijlstra@chello.nl, Linus Torvalds Ingo Molnar a =E9crit : > * David Miller wrote: >=20 >> From: Ingo Molnar >> Date: Mon, 17 Nov 2008 10:06:48 +0100 >> >>> * Rafael J. Wysocki wrote: >>> >>>> This message has been generated automatically as a part of a repor= t >>>> of regressions introduced between 2.6.26 and 2.6.27. >>>> >>>> The following bug entry is on the current list of known regression= s >>>> introduced between 2.6.26 and 2.6.27. Please verify if it still s= hould >>>> be listed and let me know (either way). >>>> >>>> >>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D11308 >>>> Subject : tbench regression on each kernel release from 2.6.22 -= > 2.6.28 >>>> Submitter : Christoph Lameter >>>> Date : 2008-08-11 18:36 (98 days old) >>>> References : http://marc.info/?l=3Dlinux-kernel&m=3D12184798611949= 5&w=3D4 >>>> http://marc.info/?l=3Dlinux-kernel&m=3D122125737421332&w=3D4 >>> Christoph, as per the recent analysis of Mike: >>> >>> http://fixunix.com/kernel/556867-regression-benchmark-throughput-l= oss-a622cf6-f7160c7-pull.html >>> >>> all scheduler components of this regression have been eliminated. >>> >>> In fact his numbers show that scheduler speedups since 2.6.22 have=20 >>> offset and hidden most other sources of tbench regression. (i.e. th= e=20 >>> scheduler portion got 5% faster, hence it was able to offset a=20 >>> slowdown of 5% in other areas of the kernel that tbench triggers) >> Although I respect the improvements, wake_up() is still several=20 >> orders of magnitude slower than it was in 2.6.22 and wake_up() is at= =20 >> the top of the profiles in tbench runs. >=20 > hm, several orders of magnitude slower? That contradicts Mike's=20 > numbers and my own numbers and profiles as well: see below. >=20 > The scheduler's overhead barely even registers on a 16-way x86 system= =20 > i'm running tbench on. Here's the NMI profile during 64 threads tbenc= h=20 > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]: >=20 > Throughput 3437.65 MB/sec 64 procs > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 21570252 total=20 > ........ > 1494803 copy_user_generic_string=20 > 998232 sock_rfree=20 > 491471 tcp_ack=20 > 482405 ip_dont_fragment=20 > 470685 ip_local_deliver=20 > 436325 constant_test_bit [ called by napi_disable_pendin= g() ] > 375469 avc_has_perm_noaudit=20 > 347663 tcp_sendmsg=20 > 310383 tcp_recvmsg=20 > 300412 __inet_lookup_established=20 > 294377 system_call=20 > 286603 tcp_transmit_skb=20 > 251782 selinux_ip_postroute=20 > 236028 tcp_current_mss=20 > 235631 schedule=20 > 234013 netif_rx=20 > 229854 _local_bh_enable_ip=20 > 219501 tcp_v4_rcv=20 >=20 > [ etc. - see full profile attached further below ] >=20 > Note that the scheduler does not even show up in the profile up to=20 > entry #15! >=20 > I've also summarized NMI profiler output by major subsystems: >=20 > NET overhead (12603450/21570252): 58.43% > security overhead ( 1903598/21570252): 8.83% > usercopy overhead ( 1753617/21570252): 8.13% > sched overhead ( 1599406/21570252): 7.41% > syscall overhead ( 560487/21570252): 2.60% > IRQ overhead ( 555439/21570252): 2.58% > slab overhead ( 492421/21570252): 2.28% > timer overhead ( 226573/21570252): 1.05% > pagealloc overhead ( 192681/21570252): 0.89% > PID overhead ( 115123/21570252): 0.53% > VFS overhead ( 107926/21570252): 0.50% > pagecache overhead ( 62552/21570252): 0.29% > gtod overhead ( 38651/21570252): 0.18% > IDLE overhead ( 0/21570252): 0.00% > --------------------------------------------------------- > left ( 1349494/21570252): 6.26% >=20 > The scheduler's functions are absolutely flat, and consistent with an= =20 > extreme context-switching rate of 1.35 million per second. The=20 > scheduler can go up to about 20 million context switches per second o= n=20 > this system: >=20 > procs -----------memory---------- ---swap-- -----io---- --system-- -= ----cpu------ > r b swpd free buff cache si so bi bo in cs us= sy id wa st > 32 0 0 32229696 29308 649880 0 0 0 0 164135 200= 26853 24 76 0 0 0 > 32 0 0 32229752 29308 649880 0 0 0 0 164203 200= 32770 24 76 0 0 0 > 32 0 0 32229752 29308 649880 0 0 0 0 164201 200= 36492 25 75 0 0 0 >=20 > ... and 7% scheduling overhead is roughly consistent with 1.35/20.0. >=20 > Wake up affinities and data flow caching is just fine in this workloa= d=20 > - we've got scheduler statistics for that and they look good too. >=20 > It all looks like pure old-fashioned straight overhead in the=20 > networking layer to me. Do we still touch the same global cacheline=20 > for every localhost packet we process? Anything like that would show=20 > up big time. Yes we do, I find strange we dont see dst_release() in your NMI profile I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 net: make sure struct dst_entry refcount is aligned on 64 bytes) (in net-next-2.6 tree) to properly align struct dst_entry refcounter and got 4% speedup on tbe= nch on my machine. Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45 (net: speedup dst_release()) Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loo= pback for example) , it helps a lot tbench too.