From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Ehrhardt Date: Wed, 08 Oct 2008 13:49:21 +0000 Subject: Re: exit timing analysis v1 - comments&discussions welcome Message-Id: <48ECBA61.20803@linux.vnet.ibm.com> List-Id: References: <48DA0747.3020004@linux.vnet.ibm.com> In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: kvm-ppc@vger.kernel.org Wondering about that 30.5% for postprocessing and=20 kvmppc_check_and_deliver_interrupts I quickly checked that in detail -=20 part d is now divided in 4 subparts. I also looked at the return to guest path if the expected part=20 (restoring tlb) is really the main time eater there. The result shows=20 clearly that it is. more detailed breakdown: a) 10.94% - exit, saving guest state (booke_interrupt.S) b) 8.12% - reaching kvmppc_handle_exit c) 7.59% - syscall exit is checked and a interrupt is queued using=20 kvmppc_queue_exception d1) 3.33% - some checks for all exits d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts d3) 17.20% - can_deliver/clear&deliver exception in=20 kvmppc_check_and_deliver_interrupts d4) 4.47% - updating kvm_stat statistics e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S f1) 29.18% - restoring guest tlb f2) 4.69% - restoring guest state ([s]regs) These fractions are % of our ~12=B5s syscall exit. =3D> restoring tlb on each reenter =3D 4=B5s constant overhead =3D> looking a bit into irq delivery and other constant things like=20 kvm_stat updating --- I changed some code of the statistic updating and the interrupt delivery=20 and got this: base - impirq (d3) - impstat (d5) - impboth a) 12.57% - 11.13% - 12.05% - 11.03% exit, saving guest=20 state (booke_interrupt.S) b) 7.37% - 9.38% - 8.69% - 8.07% reaching=20 kvmppc_handle_exit c) 7.38% - 7.20% - 7.49% - 9.78% syscall exit is=20 checked and a interrupt is queued using kvmppc_queue_exception d1) 2.49% - 3.39% - 2.56% - 3.30% some checks for all exits d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit in=20 kvmppc_check_and_deliver_interrupts d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in=20 kvmppc_check_and_deliver_interrupts d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver=20 exception in kvmppc_check_and_deliver_interrupts d5) 3.65% - 4.57% - 2.68% - 4.41% updating kvm_stat=20 statistics e) 6.55% - 6.30% - 6.30% - 5.89% returning from=20 kvmppc_handle_exit to booke_interrupt.S f1) 30.90% - 28.78% - 30.16% - 29.16% restoring guest tlb f2) 4.81% - 4.77% - 5.06% - 4.66% restoring guest state=20 ([s]regs) Well the improvements (if existant) are too close to the measuring=20 inaccuracy. Therefore I can't make any assumptions right now, but the=20 patches seem at least to work. I will test them soon with a longer testcase and a bit measureing=20 overhead reduced to avoid some inaccuracy. The patch to our kvm_stat counters would make sense for style reasons=20 alone, so it is worth to try it :-) Now I go for the TLB replacement in f1. Christian Christian Ehrhardt wrote: > As intended I separated the time of one of our exits into it's=20 > segments to see where we loose the constant base time. > I looked at the syscall exit which is pretty fast and does not deviate=20 > very much. > > A little tracing and debugging later I can now show the rough=20 > breakdown where the time is spent for that exit: > > avg time spent in the segments I checked: > a) 11.5% - exit, saving guest state (booke_interrupt.S) > b) 8.5% - reaching kvmppc_handle_exit > c) 10% - syscall exit is checked and a interrupt is queued using=20 > kvmppc_queue_exception > d) 30.5% - postprocessing for all exits =3D=20 > kvmppc_check_and_deliver_interrupts and updating some statistics > e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S > f) 34% - restoring guest state tlb/regs (booke_interrupt.S) > > Ok, looking at these numbers I think we have two areas to check in=20 > detail. > - On one hand the post processing part (d) actually should not take=20 > that much time. > - On the other hand we might be able to improve the last segment by=20 > chanign our tlb management (Liu made a suggestion on the list two=20 > weeks ago) > > When thinking about this breakdown while having the complete timing in=20 > mind (see below) shows us that the average time spent for one of our=20 > most frequent exit is near the minimum duration (EMUL). This means if=20 > we can reduce some of the constant overehad this might really help us=20 > in those frequent and hot paths (also birt tlb misses will benefit=20 > from that). > > I continue on that and keep you all updated. > Christian > > Christian Ehrhardt wrote: >> Today I refactored my code in a way that it might eventually get=20 >> upstream. >> That means I removed some debug blocks, made it configurable and=20 >> print output only on request per sysrq. >> Feel free to review that code now, otherwise wait for my separate rfc=20 >> post that will follow ~next week. >> >> Additionally I realized in a discussion that my host system was still=20 >> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host). >> I took some updated numbers without all those background activities -=20 >> testacase again boot, login, ls, kill. >> >> I intend to the irq path in detail as suggested and first of all I'll=20 >> try to get some timing of the "segments" of such a call. >> This should show us where we loose our "constant base time" and give=20 >> me some hints when looking at it in detail. >> >> FYI - new numbers without high background load (monospace recommended): >> NOPV: >> MMIO count 10105 min 46 max 1534=20 >> sum 2952467 avg 292.1788 stddev 295.933 % =20 >> 2.04 >> DCR count 5428 min 40 max 209=20 >> sum 246168 avg 45.3515 stddev 6.758 % =20 >> 0.17 >> SIGNAL count 695 min 65 max 3755=20 >> sum 89767 avg 129.1612 stddev 314.421 % =20 >> 0.06 >> ITLBREAL count 80045 min 13 max 108=20 >> sum 1063127 avg 13.2816 stddev 2.338 % =20 >> 0.73 >> ITLBVIRT count 1000585 min 21 max 264827=20 >> sum 24300596 avg 24.2864 stddev 264.753 %=20 >> 16.78 >> DTLBREAL count 91206 min 13 max 69=20 >> sum 1285214 avg 14.0913 stddev 2.225 % =20 >> 0.89 >> DTLBVIRT count 977434 min 21 max 1446=20 >> sum 24007008 avg 24.5613 stddev 4.426 %=20 >> 16.58 >> SYSCALL count 10460 min 11 max 55=20 >> sum 116447 avg 11.1326 stddev 1.929 % =20 >> 0.08 >> ISI count 11724 min 11 max 61=20 >> sum 130007 avg 11.0890 stddev 1.929 % =20 >> 0.09 >> DSI count 20737 min 11 max 57=20 >> sum 230009 avg 11.0917 stddev 1.914 % =20 >> 0.16 >> EMULINST count 5683356 min 11 max 3778=20 >> sum 79339467 avg 13.9600 stddev 50.275 %=20 >> 54.78 >> DEC count 13079 min 50 max 826=20 >> sum 732712 avg 56.0220 stddev 22.382 % =20 >> 0.51 >> EXTINT count 55 min 30 max 1478=20 >> sum 10996 avg 199.9273 stddev 238.150 % =20 >> 0.01 >> FP_UNAVAIL count 280 min 11 max 53=20 >> sum 3163 avg 11.2964 stddev 3.495 % =20 >> 0.00 >> TIMEINGUEST count 7905189 min 0 max 3688=20 >> sum 10330742 avg 1.3068 stddev 8.970 % =20 >> 7.13 >> csum 15810378 =20 >> sumsum 144837890 =3D> ~2:24 runtime >> =20 >> sumavg 7241894 >> >> PV: >> MMIO count 12505 min 46 max 3087=20 >> sum 3693782 avg 295.3844 stddev 260.788 % =20 >> 4.01 >> DCR count 5595 min 40 max 706=20 >> sum 273578 avg 48.8969 stddev 31.305 % =20 >> 0.30 >> SIGNAL count 654 min 65 max 4132=20 >> sum 300027 avg 458.7569 stddev 571.130 % =20 >> 0.33 >> ITLBREAL count 71711 min 13 max 104=20 >> sum 943053 avg 13.1507 stddev 2.360 % =20 >> 1.02 >> ITLBVIRT count 750649 min 21 max 1503=20 >> sum 18178245 avg 24.2167 stddev 7.335 %=20 >> 19.71 >> DTLBREAL count 83356 min 13 max 102=20 >> sum 1146242 avg 13.7512 stddev 2.406 % =20 >> 1.24 >> DTLBPV count 30086 min 20 max 237=20 >> sum 653556 avg 21.7229 stddev 4.639 % =20 >> 0.71 >> DTLBVIRT count 772811 min 21 max 713=20 >> sum 19079477 avg 24.6884 stddev 6.593 %=20 >> 20.69 >> SYSCALL count 7647 min 11 max 57=20 >> sum 84821 avg 11.0921 stddev 1.897 % =20 >> 0.09 >> HCALL count 1 min 19 max 19=20 >> sum 19 avg 19.0000 stddev 0.000 % =20 >> 0.00 >> ISI count 9895 min 11 max 73=20 >> sum 109667 avg 11.0831 stddev 1.904 % =20 >> 0.12 >> DSI count 17974 min 10 max 57=20 >> sum 199504 avg 11.0996 stddev 2.046 % =20 >> 0.22 >> EMULINST count 2567245 min 11 max 4212=20 >> sum 40501314 avg 15.7762 stddev 65.673 %=20 >> 43.92 >> DEC count 7488 min 51 max 641=20 >> sum 426813 avg 56.9996 stddev 23.893 % =20 >> 0.46 >> EXTINT count 2215 min 31 max 1677=20 >> sum 297495 avg 134.3093 stddev 116.219 % =20 >> 0.32 >> FP_UNAVAIL count 258 min 11 max 11=20 >> sum 2838 avg 11.0000 stddev 0.000 % =20 >> 0.00 >> TIMEINGUEST count 4340090 min 0 max 3850=20 >> sum 6316079 avg 1.4553 stddev 12.599 % =20 >> 6.85 >> csum 8680180 =20 >> sumsum 92206510 =3D> ~1:32 runtime >> =20 >> sumavg 4610325 >> >> Hollis Blanchard wrote: >>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote: >>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote: >>>>>> count min max =20 >>>>>> sum avg stddev time% >>>>>> EMULINST 6,423,699 12 247,582 =20 >>>>> 91,732,705 =20 >>>>>> 14.2804 241.200 45.36 >>>>>> ITLBVIRT 1,777,242 21 264,257 =20 >>>>> 47,116,557 =20 >>>>>> 26.5111 286.040 23.30 >>>>>> DTLBVIRT 1,544,241 22 263,947 =20 >>>>> 41,765,447 =20 >>>>>> 27.0459 218.997 20.65 >>>>> (The max here is of course when the guest was de-scheduled in=20 >>>>> favor of >>>>> another host process.) >>>>> >>>>> I think it's interesting that the min instruction emulation time=20 >>>>> is 12 >>>>> usecs. In fact, our "null" exits where we do almost no processing=20 >>>>> (ISI, >>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with >>>>> interrupt handler overhead (for all exit types). >>>>> >>>> Will you consider about moving tlb manipulation out of entering path? >>>> Examining the modify array may cost some time. >>> >>> Yes, definitely. >>> >>> That's about the only thing I can see in the 440 interrupt path that >>> might take significant time. Unfortunately we can't profile that code >>> because we have no performance counters, so finding the problem will >>> require some experimentation. >> >> > > --=20 Gr=FCsse / regards,=20 Christian Ehrhardt IBM Linux Technology Center, Open Virtualization