From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Date: Wed, 08 Oct 2008 13:49:21 +0000
Subject: Re: exit timing analysis v1 - comments&discussions welcome
Message-Id: <48ECBA61.20803@linux.vnet.ibm.com>
List-Id: <kvm-ppc.vger.kernel.org>
References: <48DA0747.3020004@linux.vnet.ibm.com>
In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
To: kvm-ppc@vger.kernel.org

Wondering about that 30.5% for postprocessing and=20
kvmppc_check_and_deliver_interrupts I quickly checked that in detail -=20
part d is now divided in 4 subparts.
I also looked at the return to guest path if the expected part=20
(restoring tlb) is really the main time eater there. The result shows=20
clearly that it is.

more detailed breakdown:
a)  10.94%  - exit, saving guest state (booke_interrupt.S)
b)   8.12% - reaching kvmppc_handle_exit
c)   7.59%  - syscall exit is checked and a interrupt is queued using=20
kvmppc_queue_exception
d1)  3.33%  - some checks for all exits
d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
d3) 17.20% - can_deliver/clear&deliver exception in=20
kvmppc_check_and_deliver_interrupts
d4)  4.47% - updating kvm_stat statistics
e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
f1) 29.18% - restoring guest tlb
f2)  4.69% - restoring guest state ([s]regs)

These fractions are % of our ~12=B5s syscall exit.
=3D> restoring tlb on each reenter =3D 4=B5s constant overhead
=3D> looking a bit into irq delivery and other constant things like=20
kvm_stat updating

---
I changed some code of the statistic updating and the interrupt delivery=20
and got this:
      base - impirq (d3) - impstat (d5) - impboth
a)  12.57% -  11.13%     -  12.05%      -  11.03% exit, saving guest=20
state (booke_interrupt.S)
b)   7.37% -   9.38%     -   8.69%      -   8.07% reaching=20
kvmppc_handle_exit
c)   7.38% -   7.20%     -   7.49%      -   9.78%  syscall exit is=20
checked and a interrupt is queued using kvmppc_queue_exception
d1)  2.49% -   3.39%     -   2.56%      -   3.30% some checks for all exits
d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in=20
kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in=20
kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver=20
exception in kvmppc_check_and_deliver_interrupts
d5)  3.65% -   4.57%     -   2.68%      -   4.41% updating kvm_stat=20
statistics
e)   6.55% -   6.30%     -   6.30%      -   5.89% returning from=20
kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% -  28.78%     -  30.16%      -  29.16% restoring guest tlb
f2)  4.81% -   4.77%     -   5.06%      -   4.66% restoring guest state=20
([s]regs)

Well the improvements (if existant) are too close to the measuring=20
inaccuracy. Therefore I can't make any assumptions right now, but the=20
patches seem at least to work.
I will test them soon with a longer testcase and a bit measureing=20
overhead reduced to avoid some inaccuracy.

The patch to our kvm_stat counters would make sense for style reasons=20
alone, so it is worth to try it :-)

Now I go for the TLB replacement in f1.
Christian

Christian Ehrhardt wrote:
> As intended I separated the time of one of our exits into it's=20
> segments to see where we loose the constant base time.
> I looked at the syscall exit which is pretty fast and does not deviate=20
> very much.
>
> A little tracing and debugging later I can now show the rough=20
> breakdown where the time is spent for that exit:
>
> avg time spent in the segments I checked:
> a) 11.5%  - exit, saving guest state (booke_interrupt.S)
> b) 8.5% - reaching kvmppc_handle_exit
> c) 10%  - syscall exit is checked and a interrupt is queued using=20
> kvmppc_queue_exception
> d) 30.5%  - postprocessing for all exits =3D=20
> kvmppc_check_and_deliver_interrupts and updating some statistics
> e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S
> f) 34%  - restoring guest state tlb/regs (booke_interrupt.S)
>
> Ok, looking at these numbers I think we have two areas to check in=20
> detail.
> - On one hand the post processing part (d) actually should not take=20
> that much time.
> - On the other hand we might be able to improve the last segment by=20
> chanign our tlb management (Liu made a suggestion on the list two=20
> weeks ago)
>
> When thinking about this breakdown while having the complete timing in=20
> mind (see below) shows us that the average time spent for one of our=20
> most frequent exit is near the minimum duration (EMUL). This means if=20
> we can reduce some of the constant overehad this might really help us=20
> in those frequent and hot paths (also birt tlb misses will benefit=20
> from that).
>
> I continue on that and keep you all updated.
> Christian
>
> Christian Ehrhardt wrote:
>> Today I refactored my code in a way that it might eventually get=20
>> upstream.
>> That means I removed some debug blocks, made it configurable and=20
>> print output only on request per sysrq.
>> Feel free to review that code now, otherwise wait for my separate rfc=20
>> post that will follow ~next week.
>>
>> Additionally I realized in a discussion that my host system was still=20
>> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
>> I took some updated numbers without all those background activities -=20
>> testacase again boot, login, ls, kill.
>>
>> I intend to the irq path in detail as suggested and first of all I'll=20
>> try to get some timing of the "segments" of such a call.
>> This should show us where we loose our "constant base time" and give=20
>> me some hints when looking at it in detail.
>>
>> FYI - new numbers without high background load (monospace recommended):
>> NOPV:
>>        MMIO count      10105 min         46 max       1534=20
>> sum              2952467 avg         292.1788 stddev      295.933 % =20
>> 2.04
>>         DCR count       5428 min         40 max        209=20
>> sum               246168 avg          45.3515 stddev        6.758 % =20
>> 0.17
>>      SIGNAL count        695 min         65 max       3755=20
>> sum                89767 avg         129.1612 stddev      314.421 % =20
>> 0.06
>>    ITLBREAL count      80045 min         13 max        108=20
>> sum              1063127 avg          13.2816 stddev        2.338 % =20
>> 0.73
>>    ITLBVIRT count    1000585 min         21 max     264827=20
>> sum             24300596 avg          24.2864 stddev      264.753 %=20
>> 16.78
>>    DTLBREAL count      91206 min         13 max         69=20
>> sum              1285214 avg          14.0913 stddev        2.225 % =20
>> 0.89
>>    DTLBVIRT count     977434 min         21 max       1446=20
>> sum             24007008 avg          24.5613 stddev        4.426 %=20
>> 16.58
>>     SYSCALL count      10460 min         11 max         55=20
>> sum               116447 avg          11.1326 stddev        1.929 % =20
>> 0.08
>>         ISI count      11724 min         11 max         61=20
>> sum               130007 avg          11.0890 stddev        1.929 % =20
>> 0.09
>>         DSI count      20737 min         11 max         57=20
>> sum               230009 avg          11.0917 stddev        1.914 % =20
>> 0.16
>>    EMULINST count    5683356 min         11 max       3778=20
>> sum             79339467 avg          13.9600 stddev       50.275 %=20
>> 54.78
>>         DEC count      13079 min         50 max        826=20
>> sum               732712 avg          56.0220 stddev       22.382 % =20
>> 0.51
>>      EXTINT count         55 min         30 max       1478=20
>> sum                10996 avg         199.9273 stddev      238.150 % =20
>> 0.01
>>  FP_UNAVAIL count        280 min         11 max         53=20
>> sum                 3163 avg          11.2964 stddev        3.495 % =20
>> 0.00
>> TIMEINGUEST count    7905189 min          0 max       3688=20
>> sum             10330742 avg           1.3068 stddev        8.970 % =20
>> 7.13
>>              csum   15810378                              =20
>> sumsum         144837890 =3D> ~2:24 runtime
>>                                                           =20
>> sumavg           7241894
>>
>> PV:
>>        MMIO count      12505 min         46 max       3087=20
>> sum              3693782 avg         295.3844 stddev      260.788 % =20
>> 4.01
>>         DCR count       5595 min         40 max        706=20
>> sum               273578 avg          48.8969 stddev       31.305 % =20
>> 0.30
>>      SIGNAL count        654 min         65 max       4132=20
>> sum               300027 avg         458.7569 stddev      571.130 % =20
>> 0.33
>>    ITLBREAL count      71711 min         13 max        104=20
>> sum               943053 avg          13.1507 stddev        2.360 % =20
>> 1.02
>>    ITLBVIRT count     750649 min         21 max       1503=20
>> sum             18178245 avg          24.2167 stddev        7.335 %=20
>> 19.71
>>    DTLBREAL count      83356 min         13 max        102=20
>> sum              1146242 avg          13.7512 stddev        2.406 % =20
>> 1.24
>>      DTLBPV count      30086 min         20 max        237=20
>> sum               653556 avg          21.7229 stddev        4.639 % =20
>> 0.71
>>    DTLBVIRT count     772811 min         21 max        713=20
>> sum             19079477 avg          24.6884 stddev        6.593 %=20
>> 20.69
>>     SYSCALL count       7647 min         11 max         57=20
>> sum                84821 avg          11.0921 stddev        1.897 % =20
>> 0.09
>>       HCALL count          1 min         19 max         19=20
>> sum                   19 avg          19.0000 stddev        0.000 % =20
>> 0.00
>>         ISI count       9895 min         11 max         73=20
>> sum               109667 avg          11.0831 stddev        1.904 % =20
>> 0.12
>>         DSI count      17974 min         10 max         57=20
>> sum               199504 avg          11.0996 stddev        2.046 % =20
>> 0.22
>>    EMULINST count    2567245 min         11 max       4212=20
>> sum             40501314 avg          15.7762 stddev       65.673 %=20
>> 43.92
>>         DEC count       7488 min         51 max        641=20
>> sum               426813 avg          56.9996 stddev       23.893 % =20
>> 0.46
>>      EXTINT count       2215 min         31 max       1677=20
>> sum               297495 avg         134.3093 stddev      116.219 % =20
>> 0.32
>>  FP_UNAVAIL count        258 min         11 max         11=20
>> sum                 2838 avg          11.0000 stddev        0.000 % =20
>> 0.00
>> TIMEINGUEST count    4340090 min          0 max       3850=20
>> sum              6316079 avg           1.4553 stddev       12.599 % =20
>> 6.85
>>              csum    8680180                              =20
>> sumsum          92206510 =3D> ~1:32 runtime
>>                                                           =20
>> sumavg           4610325
>>
>> Hollis Blanchard wrote:
>>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>>>                      count      min         max            =20
>>>>>> sum            avg        stddev     time%
>>>>>>     EMULINST     6,423,699       12     247,582     =20
>>>>> 91,732,705     =20
>>>>>> 14.2804       241.200     45.36
>>>>>>     ITLBVIRT     1,777,242       21     264,257     =20
>>>>> 47,116,557     =20
>>>>>> 26.5111       286.040     23.30
>>>>>>     DTLBVIRT     1,544,241       22     263,947     =20
>>>>> 41,765,447     =20
>>>>>> 27.0459       218.997     20.65
>>>>> (The max here is of course when the guest was de-scheduled in=20
>>>>> favor of
>>>>> another host process.)
>>>>>
>>>>> I think it's interesting that the min instruction emulation time=20
>>>>> is 12
>>>>> usecs. In fact, our "null" exits where we do almost no processing=20
>>>>> (ISI,
>>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>>>> interrupt handler overhead (for all exit types).
>>>>>
>>>> Will you consider about moving tlb manipulation out of entering path?
>>>> Examining the modify array may cost some time.
>>>
>>> Yes, definitely.
>>>
>>> That's about the only thing I can see in the 440 interrupt path that
>>> might take significant time. Unfortunately we can't profile that code
>>> because we have no performance counters, so finding the problem will
>>> require some experimentation.
>>
>>
>
>


--=20

Gr=FCsse / regards,=20
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization