From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Paul E. McKenney" Subject: Re: [PATCH] poll: Avoid extra wakeups in select/poll Date: Thu, 30 Apr 2009 14:24:08 -0700 Message-ID: <20090430212408.GA15317@linux.vnet.ibm.com> References: <49F76174.6060009@cosmosbay.com> <49F767FD.2040205@cosmosbay.com> <49F76F6C.80005@cosmosbay.com> <49F77108.7060509@cosmosbay.com> <20090429091130.GA27857@elte.hu> <49F9821C.5010802@cosmosbay.com> <20090430115736.GA24349@elte.hu> <49F9B0F0.40306@cosmosbay.com> Reply-To: paulmck@linux.vnet.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Ingo Molnar , Christoph Lameter , linux kernel , Andi Kleen , David Miller , jesse.brandeburg@intel.com, netdev@vger.kernel.org, haoki@redhat.com, mchan@broadcom.com, davidel@xmailserver.org To: Eric Dumazet Return-path: Received: from e5.ny.us.ibm.com ([32.97.182.145]:32997 "EHLO e5.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753753AbZD3VYM (ORCPT ); Thu, 30 Apr 2009 17:24:12 -0400 Content-Disposition: inline In-Reply-To: <49F9B0F0.40306@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, Apr 30, 2009 at 04:08:48PM +0200, Eric Dumazet wrote: > Ingo Molnar a =E9crit : > > * Eric Dumazet wrote: > >=20 > >> Ingo Molnar a =E9crit : > >>> * Eric Dumazet wrote: > >>> > >>>> On uddpping, I had prior to the patch about 49000 wakeups per=20 > >>>> second, and after patch about 26000 wakeups per second (matches=20 > >>>> number of incoming udp messages per second) > >>> very nice. It might not show up as a real performance difference = if=20 > >>> the CPUs are not fully saturated during the test - but it could s= how=20 > >>> up as a decrease in CPU utilization. > >>> > >>> Also, if you run the test via 'perf stat -a ./test.sh' you should= =20 > >>> see a reduction in instructions executed: > >>> > >>> aldebaran:~/linux/linux> perf stat -a sleep 1 > >>> > >>> Performance counter stats for 'sleep': > >>> > >>> 16128.045994 task clock ticks (msecs) > >>> 12876 context switches (events) > >>> 219 CPU migrations (events) > >>> 186144 pagefaults (events) > >>> 20911802763 CPU cycles (events) > >>> 19309416815 instructions (events) > >>> 199608554 cache references (events) > >>> 19990754 cache misses (events) > >>> > >>> Wall-clock time elapsed: 1008.882282 msecs > >>> > >>> With -a it's measured system-wide, from start of test to end of t= est=20 > >>> - the results will be a lot more stable (and relevant) statistica= lly=20 > >>> than wall-clock time or CPU usage measurements. (both of which ar= e=20 > >>> rather imprecise in general) > >> I tried this perf stuff and got strange results on a cpu burning=20 > >> bench, saturating my 8 cpus with a "while (1) ;" loop > >> > >> > >> # perf stat -a sleep 10 > >> > >> Performance counter stats for 'sleep': > >> > >> 80334.709038 task clock ticks (msecs) > >> 80638 context switches (events) > >> 4 CPU migrations (events) > >> 468 pagefaults (events) > >> 160694681969 CPU cycles (events) > >> 160127154810 instructions (events) > >> 686393 cache references (events) > >> 230117 cache misses (events) > >> > >> Wall-clock time elapsed: 10041.531644 msecs > >> > >> So its about 16069468196 cycles per second for 8 cpus > >> Divide by 8 to get 2008683524 cycles per second per cpu, > >> which is not 3000000000 (E5450 @ 3.00GHz) > >=20 > > What does "perf stat -l -a sleep 10" show? I suspect your counters=20 > > are scaled by about 67%, due to counter over-commit. -l will show=20 > > the scaling factor (and will scale up the results). >=20 > Only difference I see with '-l' is cache misses not counted. >=20 > (tbench 8 running, so not one instruction per cycle) >=20 > # perf stat -l -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 80007.128844 task clock ticks (msecs) > 6754642 context switches (events) > 2 CPU migrations (events) > 474 pagefaults (events) > 160925719143 CPU cycles (events) > 108482003620 instructions (events) > 7584035056 cache references (events) > cache misses >=20 > Wall-clock time elapsed: 10000.595448 msecs >=20 > # perf stat -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 80702.908287 task clock ticks (msecs) > 6792588 context switches (events) > 24 CPU migrations (events) > 4867 pagefaults (events) > 161957342744 CPU cycles (events) > 109147553984 instructions (events) > 7633190481 cache references (events) > 22996234 cache misses (events) >=20 > Wall-clock time elapsed: 10087.502391 msecs >=20 >=20 >=20 > >=20 > > If so then i think this behavior is confusing, and i'll make -l=20 > > default-enabled. (in fact i just committed this change to latest=20 > > -tip and pushed it out) > >=20 > > To get only instructions and cycles, do: > >=20 > > perf stat -e instructions -e cycles > >=20 >=20 > # perf stat -e instructions -e cycles -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 109469842392 instructions (events) > 162012922122 CPU cycles (events) >=20 > Wall-clock time elapsed: 10124.943544 msecs >=20 > I am wondering if cpus are not running at 2 GHz ;) >=20 >=20 > >> It seems strange a "jmp myself" uses one unhalted cycle per=20 > >> instruction and 0.5 halted cycle ... > >> > >> Also, after using "perf stat", tbench results are 1778 MB/S=20 > >> instead of 2610 MB/s. Even if no perf stat running. > >=20 > > Hm, that would be a bug. Could you send the dmesg output of: > >=20 > > echo p > /proc/sysrq-trigger=20 > > echo p > /proc/sysrq-trigger=20 > >=20 > > with counters running it will show something like: > >=20 > > [ 868.105712] SysRq : Show Regs > > [ 868.106544]=20 > > [ 868.106544] CPU#1: ctrl: ffffffffffffffff > > [ 868.106544] CPU#1: status: 0000000000000000 > > [ 868.106544] CPU#1: overflow: 0000000000000000 > > [ 868.106544] CPU#1: fixed: 0000000000000000 > > [ 868.106544] CPU#1: used: 0000000000000000 > > [ 868.106544] CPU#1: gen-PMC0 ctrl: 00000000001300c0 > > [ 868.106544] CPU#1: gen-PMC0 count: 000000ffee889194 > > [ 868.106544] CPU#1: gen-PMC0 left: 0000000011e1791a > > [ 868.106544] CPU#1: gen-PMC1 ctrl: 000000000013003c > > [ 868.106544] CPU#1: gen-PMC1 count: 000000ffd2542438 > > [ 868.106544] CPU#1: gen-PMC1 left: 000000002dd17a8e >=20 > They stay fix (but only CPU#0 is displayed) >=20 > Is perf able to display per cpu counters, and not aggregated values ? >=20 > [ 7894.426787] CPU#0: ctrl: ffffffffffffffff > [ 7894.426788] CPU#0: status: 0000000000000000 > [ 7894.426790] CPU#0: overflow: 0000000000000000 > [ 7894.426792] CPU#0: fixed: 0000000000000000 > [ 7894.426793] CPU#0: used: 0000000000000000 > [ 7894.426796] CPU#0: gen-PMC0 ctrl: 0000000000134f2e > [ 7894.426798] CPU#0: gen-PMC0 count: 000000ffb91e31e1 > [ 7894.426799] CPU#0: gen-PMC0 left: 000000007fffffff > [ 7894.426802] CPU#0: gen-PMC1 ctrl: 000000000013412e > [ 7894.426804] CPU#0: gen-PMC1 count: 000000ff80312b23 > [ 7894.426805] CPU#0: gen-PMC1 left: 000000007fffffff > [ 7894.426807] CPU#0: fixed-PMC0 count: 000000ffacf54a68 > [ 7894.426809] CPU#0: fixed-PMC1 count: 000000ffb71cfe02 > [ 7894.426811] CPU#0: fixed-PMC2 count: 0000000000000000 > [ 7905.522262] SysRq : Show Regs > [ 7905.522277] > [ 7905.522279] CPU#0: ctrl: ffffffffffffffff > [ 7905.522281] CPU#0: status: 0000000000000000 > [ 7905.522283] CPU#0: overflow: 0000000000000000 > [ 7905.522284] CPU#0: fixed: 0000000000000000 > [ 7905.522286] CPU#0: used: 0000000000000000 > [ 7905.522288] CPU#0: gen-PMC0 ctrl: 0000000000134f2e > [ 7905.522290] CPU#0: gen-PMC0 count: 000000ffb91e31e1 > [ 7905.522292] CPU#0: gen-PMC0 left: 000000007fffffff > [ 7905.522294] CPU#0: gen-PMC1 ctrl: 000000000013412e > [ 7905.522296] CPU#0: gen-PMC1 count: 000000ff80312b23 > [ 7905.522298] CPU#0: gen-PMC1 left: 000000007fffffff > [ 7905.522299] CPU#0: fixed-PMC0 count: 000000ffacf54a68 > [ 7905.522301] CPU#0: fixed-PMC1 count: 000000ffb71cfe02 > [ 7905.522303] CPU#0: fixed-PMC2 count: 0000000000000000 >=20 >=20 > >=20 > > the counts should stay put (i.e. all counters should be disabled).=20 > > If they move around - despite there being no 'perf stat -a' session= =20 > > running, that would be a bug. >=20 > I rebooted my machine then got good results >=20 > # perf stat -e instructions -e cycles -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 240021659058 instructions (events) > 240997984836 CPU cycles (events) << OK >> >=20 > Wall-clock time elapsed: 10041.499326 msecs >=20 > But if I use plain "perf stat -a sleep 10" > it seems I get wrong values again (16 G cycles/sec) for all next perf= sessions I have to ask... Is it possible that the machine runs at 3GHz initially, but slows down to 2GHz for cooling reasons? One thing to try would be to run powertop= ,=20 which displays the frequencies. I get the following if mostly idle: PowerTOP version 1.8 (C) 2007 Intel Corporation Cn Avg residency P-states (frequencies) C0 (cpu running) (14.1%) 2.17 Ghz 4.3% C1 0.0ms ( 0.0%) 1.67 Ghz 0.0% C2 0.5ms (16.2%) 1333 Mhz 0.0% C3 0.5ms (69.8%) 1000 Mhz 95.7% And the following with an infinite loop running: PowerTOP version 1.8 (C) 2007 Intel Corporation Cn Avg residency P-states (frequencies) C0 (cpu running) (54.3%) 2.17 Ghz 100.0% C1 0.0ms ( 0.0%) 1.67 Ghz 0.0% C2 1.2ms ( 1.7%) 1333 Mhz 0.0% C3 1.3ms (44.0%) 1000 Mhz 0.0% But I am probably missing the point here... Thanx, Paul > # perf stat -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 80332.718661 task clock ticks (msecs) > 80602 context switches (events) > 4 CPU migrations (events) > 473 pagefaults (events) > 160665397757 CPU cycles (events) << bad >> > 160079277974 instructions (events) > 793665 cache references (events) > 226829 cache misses (events) >=20 > Wall-clock time elapsed: 10041.302969 msecs >=20 > # perf stat -e cycles -a sleep 10 >=20 > Performance counter stats for 'sleep': >=20 > 160665176575 CPU cycles (events) << bad >> >=20 > Wall-clock time elapsed: 10041.491503 msecs >=20 >=20 > >=20 > > Also, the overhead might be profile-able, via: > >=20 > > perf record -m 1024 sleep 10 > >=20 > > (this records the profile into output.perf.) > >=20 > > followed by: > >=20 > > ./perf-report | tail -20 > >=20 > > to display a histogram, with kernel-space and user-space symbols=20 > > mixed into a single profile. > >=20 > > (Pick up latest -tip to get perf-report built by default.) >=20 > Thanks, this is what I use currently >=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-kerne= l" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >=20